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ABSTRACT 



Although there are fundamental differences in the objectives of the 
two activities^ the prograjnming of instructional material bears 
many similarities to the construction of tests* A systematic 
comparison of problems and procedures reveals important implications 
for prograiming from the older field of testing* Theory and experience 
in test construction can be especially useful in the selection of 
valid criteria for assessing the effectiveness of a program^ the 
ordering of instructional subject matter^ the writing of instructional 
frames^ and the formal evaluation of the program* Adaptive prograjnming 
implies measurement of both aptitude and achievement in order to 
assign trainees to appropriate individual sequences of instruction* 
Possible applications resulting from examination of these and other 
issues are explored^ and necessary further research is suggeiJted* 
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Section 1* Introduction 
Pii3rpose and Plan of This Report 



Almost every person in America has had some contact with tests* Pfeople 
\afe given tests in school to determine what may te expected of them and what 
^^^y have already learned^ they are given tests so that employers can select 
the test qualified applicants^ they are tested for voter registration^ for 
driver's licenses^ etc* Because of this first-hand experience in test-taking^ 
most people have some ideas^ correct and incorrect^ atout what tests are^ the 
purpose of using tests^ the value of tests^ and their limitations* 

In recent years psychologists and educators have paid a great deal of 
attention to printed materials and to devices which seem to closely resemtle 
tests^ tut which are called auto- instructional programs^ teaching machines 
self -instructional devices^ automated tutors^ etc*^ and which are used for 
quite a different purpose than that for which tests are used* We will call 
these printed materials auto-instructional programs^ or^ sinrply^ programs* 
Most people have protatly not had first-hand experience with programs^ although 
they may have read atout them in newspapers and magazines* 

The purpose of this report is to exaia^ne the extent to which what we knov 
al^out tests can te applied to the development of programs* Because this pur- 
pose is a relatively restricted one^ no attempt has teen made to deal with 
either testing or programming in a comprehensive way; the emphasis is on 
applications of testing to programming* 

The report is intended primarily for people now engaged in training and 
programming activities* For the benefit of those readers who are just teconiing 
acquainted with programming^ this section and Section 2 provide general tack- 
ground information on testing and programming* Sections 5 which consti- 
tute the main tody of the report^ deal with how we can, use what we know atout 
testing when ye set oiit to construct a program* Section 5 provides a summary 
of the report* 
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. Tests and Programs: Similarities^ Differences^ axid Relationships 

We use tests to help us decide how to sort out or classify people* We 
may want to sort them into those who will succeed in college and those who 
will not; those who should he given a driver's license^ those who should he 
given a license subject to certain restrictions^ and those who should not 
he given a license; those who should he given school grades of "A^** of *'B^" 
of "C^" of "D^" and of "F'*) etc. In general^ tests tell us what people do 
in certain situations (on test questions or '*items")^ and we may use this 
Infonnatlon for auch purposes as predicting what they will do in other situ- 
ations * 

■On the other hand^ ve .vise auto -instructional programs to teach people^ 
or^ as a psychologist might say^ to ncdlfy their behavior so that their 
performance on some class of tasks is different because of having been through 
the program. We must keep in mind this fundamental difference in reasous for 
using tests and for using programs; ve use tests to measure present behavior 
so that ve can predict future behavior^ while we use auto- instructional 
programs to modify or produce future behavior > 

While tests and programs do differ in vhat they are used for^ there are 
certain similarities and relationships between tests and programs vhlch sug- 
gest that it would be worthwhile to look at o\ir accxmiulated knowledge of 
testing to see what implications for pTOgramming might exist* 

To begin wlth^ tests and progrEuns are similar in appeareuice* Both tests 
and programs basically consist of sets of questions, Th^ technical term for 
a test question is an Item^ while the technical term for a program question 
is a frame* It is very frequently difficult to tell from inspection whether 
an individual question is meant to be a test item or an auto- Instructional 
frame* Which of these questions do you think are meant to be items and which 
do you think cure meant to be frames? 

6* You have purchased 7 chances in a lottery for a nev car* A total of 
10^000 chances vere sold. What is the possibility that you might win? 

6* What is the sum of 7 and 3? 

8* A precursor of Vitamin A is a *^ 



Often a test question is accompanied by some expository material^ that 
Is^ some information not in the question itself which the examinee vlll need 
in order to answer it* An auto-instructional frame may also be accompanied 
by expository material* It is difficulty therefore^ to tell from the pres- 
ence or absence of expository material whether a question 1b m^euit to be a 
test item or an instructional frame* 

While it may be difficult to tell whether an isolated question is meant 
to be a test item or auto- instructional frame, the context in vhlch the 
question occurs will probably enable one to make this distinction* An auto- 
instructional frame will usually be preceded by other frames which "lead up 



All three questions are frames* From Barlov^ Calvin^ and Olaser^ 
respectively^ as reproduced in Rigney-and Pry (ref * 111)* 
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to it" ani make it easier to answer it,» A test item^ however^ will usually 
stand by itself; the test constructor will try to avoid having other items 
in the same test which help the examinee answer a given item. 

An additional clue as to whether a given question is ineant to be a test 
item or an auto- instructional frame is this; after the learner responds to 
an auto -instructional frame he will usually be exposed to the correct answer* 
In this way he can find out whether the answer he gave was correct^ that is^ 
he can receive knowledge of results* After the examinee responis to a test 
question^ he usually does not receive knowledge of results in those tests 
that are currently available. This does not mean^ however^ that providing 
the examinee with knowledge of results would necessarily defeat the measure- 
ment purpose of using the test^ or that future tests will not provide knowl- 
edge of results to the examinee. In fact^ Severin (ref . 119)^ Bask (ref» ICik), 
and Pressey (ref . 106) have suggested that under certain circumstances tests 
which provide knowledge of results can do abetter job of measurement* 

Suppose we found ourselves in the improbable situation^ after looking at 
a set of questions^ of not being sure whether it constituted a test designed 
to measure people's behavior or an auto*instructional program designed to 
modify people*s behavior* Could we not just use the set of questions in order 
to see how it functioned? 

Since we define a program as a set of questions which serves to modify 
behavior^ perhaps the simplest way to collect data in order to classify an 
unknown set of questions as a test or program is to administer the set of 
questions twice to the same group of people. If people get substantially 
more questions right the seQond time than they do the first time^ the set of 
questions has modified behavior and may be called a program* Numerous studies 
» have shown^ however^ that when people are repeatedly given sets of questions 
that are already known to be tests (they have ali*eady been successfully used 
to measure and predict behavior)^ they get more questions right each succeeding 
time* 

A second way we might collect data in order to classi^ a given set of 
items as a test or a program is to see whether the earlier questions affect 
performance on the later questions* If so^ then presumably the set of 
questions constitutes a program. Unfortxmately^ certain studies have shown 
that for sets of questions that are already known to be tests^ the presence 
of some questions affects the performance on others. 

The two possibilities we have considered above for collecting data in 
order , to be aible to classify a set of questions as a test or a program will 
not work* In each case sets of questions which are designed to be tests 
and which are useful as tests have certain characteristics which we might 
expect only programs to have** Furthermore^ tests have these characteristics 
when they are administered without providing the examinee with knowledge of 
results (without telling him whether he is ri^t after each response). 



We shall see in Section k that sets of questions which are designed to 
be programs and which are useful as programs may have certain characteristics 
which we might expect only tests to have* 
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It appears, then, that althou^ a test constructor may intend his set 
of questions to merely measure hehavlor, the actual use of his questions 
will oftea (if not always) modify hehavior also, and the test constructor 
will find himself with a program on his hands* If he wanted to construct 
a program^ he mighty of course^ proceed differently than if he wanted to 
construct a test* To take one ohvious procedural difference, in constructing 
a program he would arrange for the learner to receive knowledge of results 
following his responses* But questions which are not accompanied by knowl- 
edge of results may also have an instructional function^ that is, they may 
also modify hehavior. An interesting example of this has recently heen 
reported hy Estes (ref * 36^ p* 220)* He compared the performance of two 
groups of Ss who were repeatedly given a set of questicisft under the knowl- 
edge of results (l) and no knowledge of results (T) conditions shown helow: 

Group 1 L L T T 

Group 2 L T L T T 

Time ^ 

Note that the two Si'oups differ only in that Group 2 was giwn the set 
of questions without knowledge of results one additional time* X^t this 
additional opportunity, which one would e5<5ect to have only a testing 
function, had also an instructional fxmction: Group 2 showed 789^ retention 
while Group 1 showed only 52^ retention* 

All of the preceding discussion may he summarised as follows: not only 
is it difficult to distinguish hett/een tests and programs upon inspection, 
hut it is also difficult to-jil^tinguish hetween them hy collecting data. 
Questions may hoth measure and modify hehavior (at the sanK time), even if 
they are intended merely to measure hehavior* It is reasonahle to expect^ 
therefore^ that our knowledge of test construction will he useful in program 
construction * 

An additional similarity hetween testing and programming suggests that 
our knowledge of test construction will he useful in program construction* 
This similarity is in the general steps one takes or should take in the con*- 
ception^ construction, and evaluation of tests and auto-instructional pro- 
grams* For hoth tests and programs, these stages nay he lahelled %)ecifying 
Objectives^ Determining the Resources Availahle^ PlEuining and Developing 
Items (Frames), Pretesting and Revision^ Evaluation, and Providing Infor- 
mation to Test (Program) Users** 

In addition to the similarities hetween testing and programming which 
have heen noted ahove^ there are two general relationships hetween testing 
and programming which also suggest that test construction knowledge will he 
useful to the programmer. It will he helpful to look hriefly at one way of 



' , In Section 2 these stages will he elahorated upon for auto'^instruc- 
tional programming and in Section 5 they will he elaborated upon for 
testing * 
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subdividing the category "tests" and one way of sulDdlvidlng the category 
"programs" Ijefore discussing these relationships » 

A very common typ€t of test is the achievement test* The achievement 
test Is used to determine how much has Ijeen learned* One \ise of achievement 
tests In school settings Is to see whether each student has mastered the 
material he needs In order to master material In higher grades* One may 
also use achievement tests In school settings to see how much groups of 
students have learned under different teaching methods* In the first case^ 
one Is Interested In making a declslomalDout students^ and In the second 
case, one Is interested In making a decision alDout teaching methods* In 
iDoth cases^ the achievement test gives Information on how Mich has l^een 
learned* 

Achievement tests may also iDe used In two ways In conjunction with auto** 
Instructional prograMiing* We may want to find out whether a student has 
mastered the material to which he has ijeen exposed so that we then can put 
him In a position to utilize what he has mastered^ either In further formal 
Instruction or In a joh situation* Or we may want to see how well auto- 
Instructional teaching comrpares with, say^ classroom lectt're and discussion 
teaching* Whenever we use au^o- Instructional programming we are concerned 
with iDoth of these questions^ 'and the key to answering them lies In the use of 
an achievement test* For this reason knowledge ahout achievement testing 
is important in the construction and evaluation of auto-^lnstructlonal 
programming * 

We have focused our atte::tlon on a particular type of test^ the achieve- 
ment test, and we have consequently seen that there was an important relation- 
ship hetween testing and programming* We will now focus attention on a par- 
tlctxlar type of auto-instructional program, and will uncover another Important 
relationship /between testing and programming* 

One way that auto-instructional programming may ije more effective than 
conventional instructional modes (such as traditional classroom teaching) is 
that with auto- instructional programming all students need not he presented 
with the same sequence of material. Students differ in how much they initially 
know of a given suhject matter^ in how quickly they can acquire new knowledge 
and skills, and in the extent of their misconceptions* A teacher, who in- 
structs many students slmultaneoiisly^ may not he ahle to provide just the right 
amount and sequence of material for the most .riclent learning of each 
student* So the teacher may decide (consciouitly or unconsciously) upon a 
sequence of material which he hopes will he hoth adequate for the slower 
learners to grasp the essentials and interesting enough to keep the faster 
learners from hecoming hored* Ke is often unsuccessf^^; the pace may he too fast 
for some students and too slow for others; misconceptions may i^e cleared up 
for some^ while others may hecome confused* We may expect this to happen not 
only when one teacher provides instruction for many students slimiltaneously. 



Other ways of suhdlvldlng these categories will he mentioned later in 
this report as the distinctions are needed* Cronhach states "Although some 
scheme of classifying tests is a convenience^ all such divisions are arhltrary* 
One of the striking trends is the hreakdown of traditional division lines" 
(ref* 21)* For a discussion of *'typerf'of programs^ see Rlgney and Fry 
(ref* 111)* 

10 
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but also when one textbook or film or auto-instructionol prograjn provides 
instruction for many students simultaneously* 

For this reason^ uumy people who cLevelop auto^-instructional prograius 
in order to make learning more efficient have been interested in providing 
each individual learner with a sequence of material tailored to fit his 
particular needs and abilities^ or at least in approximating this condition 
insofar as limitations in devices^ developmental costs^ and operational 
costs aJJLow* The programmer may provide some learners with more frames^ 
with different frames^ or with a different ordering of frames than other 
learners* In general, when the programmer does not provide each learner 
with an identical sequence of material, we will refer to this as adaptive 
programming* At this point, the significance of adaptive programming is 
this: in order to assign different learners to different sequences of 
materials, we need to first have some information on the basis of which we 
can classify them* If we wish to use rather different sequences of material 
for "fast" and "slow** learners, we must first know who the '*fast*' and "slow*' 
learners are* If we wish to provide supplementary information to correct a 
misconception, we must first know which students have this misconception* 
In general, adaptive progrannning involves getting some information about the 
learner so that we can give him material especially suited to him* The pro- 
cedure of getting this information is, of coxirse, a testing procedxire* When 
adaptive programming is used, therefore, there is an additional relationship 
b&tween testing and programming* 
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Section 2» Procedures In ProgramUpg 



Auto-Ins t rue tlonal programming Is relatively new, and there are still 
many unanswered questions associated with lt»* There Is no general agree- 
ment, for instance, as to whether the student should he prevented from 
making wrong responses, or whether the student should he required to make 
any overt responses. There Is no agreement on whether; a single program Is 
hest for all students. The programmers who appear to represent different 
"schools" or "philosophies" of programming (e.g.. Skinner, Crowder, Pressey) 
differ ir. how explicit they are in describing what steps they follow in 
developing a program* They also differ in how explicit they are about what 
they do at each of these steps. 

For these reasons it is imposslhle to present a general yet accurate 
picture either of how programs are written or of how programs should "be 
written. Instead, we will list and discuss the steps in what, at the moment, 
appears to he an over-complete and idealized procedure for developing a 
program* 

Stating one's ohjectlves has heen mentioned as a first step hy Closer 
(ref. 57), Holland (ref» 70), Skinner (ref. 123), Stolurow (ref» 227l and 
others. It has not "been mentioned hy Crowder (ref » 27), in his expositions 
of his "intrinsic" programming philosophy and technique* 

What is meant hy a statement of ohjectlves and why is it needed? 
%veral writers have suggested that a statement of objectives should he a 
set of items that the programmer wants tlie students to he ahle to pass . 
"One ml^t think of these as answers to questions which ml^t appear on a 
final examination for a given course** (ref. 15, p. 550)* When one person 
coomilsslons another to develop an auto<»lnst3nictlonal program, he must, of 
course, state what he wants the program to accomplish* In order for this 
statement to he useful, it must he specific. 

Statements such as "proficiency in electronic trouhleshooting" and 
"facility with symbolic logic" are not sufficiently specific. Tbe programmer 
must state just whet it is that the person who is proficient in electronic 
troubleshooting or facile with symholic logic can do,** Skinner, for 
example, has analyzed and further specified the "ability to read" as follows: 

"..»a child reads or 'shows that he knows how to read* hy 
exhlhlting a "behavioral repertoire of great complexity. 
He finds a letter or word in a list on demand; he reads 
aloud; he finds or identifies ohjects described in a text; 
he rephrases sentences; he oheys written instructions; he 
hehaves appropriately to described situation**; he reacts 
emotionally to described events; and so on, in a long list*' 
(ref. 22h, p. 383). 



Some general sources of information on programmed instruction are 
Kbpsteln and Shlllestad (ref. 88), lumsdaine and Glaser (ref. 95), Rlgney 
and Fry (ref» 111), and Stolurow (ref. 12?). 

-A , 

For fiirther discussion of how to specify objectives, see Mager 
(ref. 97)- 1 9 
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If the programmer Is developing the program on his ovn tehalf and has 
previously taught the subject matter, he may feel that the writing -down of 
his otjectiveg is unnecessary, and, if it must "be done, it might te done 
more easily after the program is written* There is no evidence that a 
teacher who knows what he vants to teach will do a tetter idb of programming 
the material after writing down his objectives* But certain logical con- 
siderations suggest that it may he a good idea for him to do so* 

One of these considerations is that the next step in program construction 
depends on the prior step of stating one*s ohjectives* Kiis step is to find 
out where the students stand now, that is, to find out how competent they 
initially are in -Oie directions desired* One way of getting this information 
Is to administer a test of si&ject matter achievement to them**-a test of the 
proficiency which the program will he designed to develop* As we shall see 
ag^in in the next section, the programmer can only construct such a test 
after he first states his instructional ohjectives* 

Once the programmer has decided what is to he taught and has found out 
what the students already know that is relevant, the next steps are to 
organize the suhject matter, to spell out the relationships among the com* 
ponents of the subject matter, to specify what muBt he taught hefore what, 
and to decide upon a method of programming which is hest sxiited to the par- 
tictaar material* Among the availahle methods are Skinnarian programming 
(refs*123, 125)^ intrinsic programming (ref* 27) and Ruleg programming 
(ref* kO)* Samples of these and other methods or "styles" of programming 
are given 'by Eigney and Pry (ref* 111)* 

When the programmer has organized the subject matter and decided upon 
a suitahle siyle of programming, he can then start the actual writing of the 
program* General suggestions for program writing are offered hy Gilhert 
(ref* 55)** 

According to Rigney and Fry (rer* 111), the program writer does not 
have an easy or routine task* He does not (or should not) merely add detail 
to an already existing texthook organization, hreak down the printed matter 
into one or two sentence segments called frames, and then delete one word 
at random from each frame and replace it with a hlank to he filled in hy 
the learner* Eather^ he must first organize the material in a way that 
seems hest, considering what the learner can initially do and what he wants 
the learner ixltimately to he ahle to do* Then the programmer must devise 
e5<5)ository material and questions which are meant to Iiave specific functions 
in getting the learner where he wants him* Above all, his writing of the 
program must be sensitive to feedback frcan the learner at all times* Shis 
means that he must repeatedly present the program to learners^ in order to 
find out whether the^ are, in fact, learning what he wants them to learn*** 



5^her^ Glaser^ and Schaefer (ref* 129) are preparing an extensive 
treatment of program writing, 

neither the writings of Crowder (ref* 27) nor of others who have also 
used intrinsic programming and scrajiibled hook format (e,g*, Gorow^ ref, 6O; 
Lawson, ref* 90) are very e5<5)licit on whether the provisional frames written 
by the programmer are tried out on stxjdents* 
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Generally^ this Involves trying out sets of frames on learners as they are 
written^ or perhaps after some editorial revision l)y subject inatter experts 
or ty colleagues of the prograjonmer* 

After the programmer tries out his provisional frames^ his next step 
is to revise them on the tasis of the informtion he gets from the tryout* 
He inay revise particular frames hecause students indicate they cannot tander- 
stand vhat is required of them^ hecause stitdents cannot do vhat is required 
of them^ hecause the fraines do not provide enou^ hackground for the student 
to handle later fraoses, etc. 53ie programmer then tries out the revised 
(and prohably e5q)anded) set of frames* He may repeat this tryout and revision 
process several times before he is ready to evaluate 13ie program in a niore 
formal manner. 

For the forml evaluation the programmer works with a somewhat larger 
group of students than hefore. Ideally^ this group is representative in 
hoth ahilities and motivation of the still larger group for which the 
program is intended* If the programmer merely wants to know how well the 
program teaches^ he sends the students through it^ and then administers an 
achievement test (a "posttest") to the students. The achievement test^ 
which measures how con^tent the students are in the area for which the 
program was designed^ will have 'been developed as a hy-product of the steps 
of specifying objectives and measuring the learners* initial competence* 
Ohe programmer will then he able to make a statec^nt of the form "When 
students having these (specified) characteristics go through my program 

they are then ahle to get a mean score of with a range of scores of 

from to ^ on this achievement test." 

In general^ however^ the programmer will he concerned not merely with 
how well the program teaches^ hut with how well it teaches relative to how 
well soma alternate presently used method teaches^ and relative also to the 
cost of each^of these methods* He may^ therefore^ directly compare the 
program with the presently used method, 

A final step in evaluation^ one which programmers have not yet taken^ 
is to determine how well the program teaches stitdents whose characteristics 
(such as aptitxides^ ahilities^ etc*) are known to he different from the 
characteristics of the group of students on whom the pro-am was originally 
evaluated* One approach to determining this will he discussed in Section 3 
under the heading "Providing information to Test (Program) Users/* 
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Section 5» Steps In Test Construction and 
Tneir Implications for Programming 

In this section we will see what the steps are in the construction 
of tests, how these-jsteps are related to the ^ceps in the construction of 
programs, and what impiicationg we can derive i^or the construction of 
programs * 

Specifying Objectives 

When a test constructor sets out to build a test for his own use, or 
at the request and for the use of some other person, his first step is to 
spell out the specific pxirpose of the test* In Section 1 we saw that tests 
are used to classify peojxLe* The test constructor must, therefore, specify 
who the people are that he is interested in, and what his purpose is in 
classifying them* 

The basic reason for specifying who the people are that he is in- 
terested in is that a test which is useful with one population is, in 
general, not equally useful with other populations. A test consisting of 

questions on arithnietic facts (e*g*. Three times four equals ? ) may br> 

useful in sorting out "more competent in arithmetic" from *'less conqpetent 
in arithmetic" third graders, but may not be useful in sorting out "more 
competent in arithmetic" from "less competent in arithinetic" college stu- 
dents. Furthermore, if the test constructor wants to see if an already 
available test can be used for his purposes, he must look at the population 
at which the test is aimed and compare it with his own target population.* 

When it comes to programming, we would also expect that a program 
which is useful with one population is, in general, not equally useful 
with other populations. A program designed for use by high school students 
might not be appropriate for use by elementary school students because the 
elementary school students do not have the appropriate background, that is, 
they do not know the facts, generalizations, concepts, etc*, that are needed 
to benefit from the program* The same program might not be appropriate for 
use with college students because they might already know the facts, gener- 
alizations, and concepts which the program covers* 

We do not now have a body of reliable knowledge that we could use in 
jtbdging how useful a given program is with a population different from thr 
one for which the program was designed (see ref * 120)* One approach to 
this problem is based upon testing considerations, and will be discussed 
later in this section* The procedure of using different programs to teach 
the same material to different learners will be discussed in Section 

Once the test constamctor has chosen the people he wishes to classify, 
he can turn to what his purpose is in classifying them* If he is developing 
a selection test, then his purpose is to classify them according to how 
adequately they would perform in a given situation* With applicants to 
college^ it may be whether they would succeed if admitted to college; with 



Thorndike (ref * l3l) discusses some considerations in deciding whether 
to use existing tests or construct new ones * 
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applicants for drivers* llcensesj It inay whether they would te good 
drivers; with jot applicantSj whether they would te good machlnlstsj etc* 
We will refer to the future perfonnances of examinees which the test 
constructor wishes to predict as criterion tehavlorsj or soraetlmesj 
criteria* 

In order for the test constructor to proceed with the deveiopment of 
k selection test he must ultimately specify the criterion behaviors In 
operational terms* If he Is Interested In ''success In college** he might 
measure this success ty grade-point average* He might measure "good 
driving'* In terms of accident recordsj and "good machinists*' In terms of 
amount produced* There mayj of coursej te several alternate ways for the 
test constructor to operationally specify what he Is Interested In* He 
mightj for exaiiiplej choose to measure success In college ty disciplinary 
record rather than ty grade-point averagej or ty toth disciplinary record 
and grade-point average * 

'If J rather than a selection testj the test constructor Is developing 
an achievement testj he Is Interested In classifying people according to 
how much they have learned* In this case his test defines the criterion 
tehavlorj and agalnj the definition must te In operational terms* 

The reader will recall that a first step In programming Is for the 
programmer to specify his objectives* This specification should also tte 
In operational terrasj that Isj the programmer should state exactly what 
It Is that he wants the learners to te able to do* These tasks constitute 
the criterion tehavlors for the program* The nature of test criterion 
behaviors ^d program criterion tehavlors Is essentially the samej and 
some tasks may Indeed serve as both* We may think of the dlff^^ence be- 
tween tests and programs In this way: tests are used to predlcV ^ define 
criterion behavlorsj and programs are used to modify criterion behaviors*., 

Whatj thenj do we know about the choice of criteria* for tests that 
we can use in choosing criteria for programs? 

To begin wlthj we know that the choice of criteria has a fundamental 
Importance In selection testing* If a poor selection test Is developedj 
this will be discovered when It falls to predict the crlterlonj andj If 
the budget allows j a better test may then be prepared* Ifj howeverj poor 
criteria are selectedj there is no opportunity for empirical evaluation- 
no feedback from the data — to warn the test constructor* 

Slmilarlyj If a poor program is developed^ this will be discovered 
when the criterion j'behavlor Is examined* Ifj howeverj poor criteria are 
selectedj there Is no empirical evaluatlonj no way for the programmer to 
discover the inadequacy of the criteria* Whether a given criterion Is 
relevant to the programmer's purpose ultimately does not depend on 
empirical evidencej but rather on a statement by the programmer and/or 
the person who requests the production of the program as to what he is 
Interested in* Saverln (ref » 119 )i for examplej working with a correction 
procedure and utilising a Pressey imiltlple -choice punchboardj was Inter- 
ested in total number of errors made on a set of Items (each Item could 
contribute more than one error )j while Stephens (ref* 126)j also working 
with a correction procedure and utilizing a Pressey multiple -choice punch- 
boardj was only Interested in the number of errors made by subjects on 
their first attempt at each item* 



Given that the choice of criteria is extremely importantj how can 
the test constructor (and the programmer) choose "good" criteria? We come 
now to a distinction hetueen ultimte and proximate criteria (ref * 92)- 
The ultimate criteria are what the test constructor is really interested 
in» The ultimate criteria for a scholastic aptitude test might he measures 
of the extent to which the student attains the goal^ of the educational 
institution; the ultimate criteria for a driver's license test might he 
how safely and courteously the driver manipulates his car in his everyday 
driving* Ultimate criteria for an auto -instructional program designed to 
teach good citizenship might he measures of the extent to which the learner 
uses his opportunities to votej of how he participates in comntiunity affairSj 
etc- Proximate criteria are whatj for any of a variety of reasons^ the 
test constructor is willing to settle for* One reason for using proximate 
criteria is that it may he too impractical^ too costlyj to measure the 
ultimate criteria- While it would he extremely difficultj or perhaps im- 
possihlej to measure a driver's everyday driving hehaviorj it is relatively 
easy to collapse the frequently encountered driving experiences such as 
turningj parkingj driving in trafficj etc*j into a more-or-less standard- 
ised five "minute road examination* 

Another reason for using proximate criteria Is that the ultimte 
criteria may not he measurahle until long after the tester is interested 
in measuring themj or perhaps they may never he measurahle* While the 
ultimate criterion for an instructional program on civil defense may he 
whether the learners can perform adequately in a disaster situation if the 
occasion arises^ the occasion may never arise* For this reason the proxi" 
mate criteria of how well learners do in a siimilated disaster situation 
may he used* While the ultimate criteria for a scholastic aptitude test 
may he measures of the extant to which the student attains the ohjectives 
of the educational instltutionj the proximate criterion which is likely to 
he used for convenience is grade-point average* 

Still another reason for using proximate criteria rather than ultimate 
criteria is that the ultimate criterion hehavior of each person may he 
different J and so measures of each person's criterion hehavior will not 
he directly comparahle* In such casesj the use of a proximate criterion 
may provide a relatively standardized set of tasks on which measures of each 
person's hehavior will he directly comparahle* We mayj for e^canrplej want 
a test which will predict the ultimate criterion of how well a salesman 
vill sell* We knOWj howeverj that a salesman's sales record will depend 
upon what territory he is assigned to as well as upon hov good a salesinan 
he is* If we have no good estimates of the sales potentials of different 
territoriesj we may usej as proximate criteriaj measures of the salesman's 
hehavior in an artificially constructed sales situation* We mi^t ask each 
salesman questions (e*g*j What would you do if the prospect says he wants 
the product hut doesn't think he can afford it?)j or we mi^t see what he 
does when confronted with a stooge as a prospective customer* 

We can see in this example the usefulness of proximate criteriaj if 
we want our criteria for a salesman selection test not to he confounded 
with the territories to which the salesmen are assigned^ we can assign them 
all to the same **territoryj " that iSj ask them the same questions^ confront 
them with the same stoogej etc* At the same time we can see in this ex- 
ample a danger in the use of proximate criteria: the proximate criteria 
may not he related to the ultimate criteria; how well a salesman answers 
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questions about what he would do in certain situations and what he does 
when confronted with a stooge may not be related to how well he can sell* 



It seems that both the test constructor and the programmer face a 
dilemma in choosing criteria for their tests and prograiDsj the ultimate 
Criteria are what they are really interested inj but It Ttiey be impossible 
in practice to obtain measures of them. The proximate criteria inay be 
convenient and relatively inexpensive to measurej but they may or may not 
be related to the ultimate criteria* 

How might this dilemma be resolved? In any situation in which 
measurement of the ultimate criteria is not possiblej one is not forced 
to choose between ultimate and proximate criteria but rather one may choose 
from among different sets of proximate criteriaj any set of which may have 
particular advantages and disadvantages* In- choosing a set of proximate 
criteria we should try to choose a set which, we know is related to the 
ultimate criteria* When it is impossible to find out whether the proximate 
and ultiiaate criteria are correlatedj we might see whether the proximate 
criteria are correlated with other proximate criteria* If we cannot do 
thisj then we must satisfy ourselves that the proximate criteria we choose 
to work with are logically related to the ultimate criteriaj which is just 
another way of saying that the proximate criteria should appear to be re- 
lated to the ultimate criteria* 

Several instances have been reported in programming in which the 
proximate criteria have not reflected the ultimate criteria of amount 
learaed* Stephens (ref • 126) used the number of errors during training as 
a proximate criterion* He found that chEUiging the order of frames euid of 
multiple -choice alternatives within frames produced more errors during 
trainingj but made no difference on a posttest* Fry (ref * k9) also used 
errors during training as a proximate criterion* For one group of learners 
he terminated training on a list of paired -associates after two consecu* 
tive errorless runs through the list, A posttest (used as ultimate cri* 
terion) showed that this group learned no more than a group given five 
minutes of training dxiring which no member of the group made two consecu- 
tive errorless runs through the list, 

Gagne'^and Dick (ref* 52j p* ^O) also found that a proximate criterion 
of errors during trainjng did not reflect their ultimate criterion of 
transfer : 

**Regardless of the internal criterion measures which were 
employed (nxunber of errorSj time to learn)j the transfer 
test scores make one reluctant to state that the learning 
program has truly tau^t ^equation-solving' *" 

In the above instances the programmers were fortunate in being able 
to collect both proximate and ultimate criterion dataj and in this way 
see the inadequacy of the proximate criteria* But what can the programmer 
do to insure a more adequate proximate criterion in situations where 
ultimate criterion data cannot be collected? 

One point to remember is to avoid what Brogden and laylor (ref* 10) ^ 
call the error of illation * One commits this error vhen one fails to 
distinguish between direct and inferential evidences of the achievement 
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in which one is interested. Brogden and T&ylor cite as an exanrple of the 
error of illation the rating of the carpenter's "skillful inoveraeuts" rather 
then the products he turns out t In prograiuraingj the evidence cited ahove 
(ref . k9j 52j 126) s\iggests that under some circumstances measures of 
leELTQing during training may provide rather poor inferential evidence of 
actual achievement. 

Id addition to the error of illationj there are various kinds of 
hias in criterion selection which hoth the test constructor and the pro^ 
grammer should avoid (ref. 10). One is called criterion deficieDcy, r-ig^ 
noring some aspect of hehavior in which one is actually interested. The 
programmer seeks to make learning more efficientj so it is generally 
essential for him to measure both amount learned and time taken to learnj 
and possiljly some other things. When he finds that program A produces 
more learning than program Bj he also wants to know how the two programs 
compare in amount of time taken hy the leetrners. Goldheck (ref. 58)j for 
examplej fo\xnd that a comparison of three versions of a program with 
regard to amount leamedj led to different conclusions than did a com-* 
parison of the same three versions with regard to amount learned per unit 
time. 

Wachman and Qpochinsky (ref. 102) foxxnd that the variable of class 
size made a difference in amount learned in class (classes containing 
fewer stxidents leetrned more)j hut that the stxidents in the larger classes 
apparently studied more on their own time tc compensate for this difference. 
If one were merely interested in amojiht learnedj one might conclude that 
students would vind up with t;he same\aanount of knowledge whether they were 
in small or large classes* Ifj howevl^one is also interested in time 
taken hy stxidcnts to learnj both within and outside of the formal class- 
room situationj one would conclude that the smaller class size led to re- 
duced learning time andj hencej greater learning efficiency. In many 
educational^ Industrialj and military training situationsj in which trainees' 
time for independent study outside the formal training situation is limitedj 
it may he important to know how much time needs to he devoted to study in 
conjunction with an auto-instructional prograrat Failure to consider this 
would result in criterion deficiency. 

Another type of criterion hias is criterion contamination — when irrele- 
vant considerations enter into the measurement of the criterion hehavior. 
The reader may have recognized an qpportunity for criterion contamination 
to occur in an earlier example; when salesmen are assigned to territories 
which differ in "sales potential/* then their volume of saleSj as ^ cri- 
terionj will reflect hoth their selling ability and the sales potential of 
their assigned territory. At hestj if salesmen are randomly assigned to 
territoriesj the criterion measure will merely he imprecise; if salesmen 
are assigned to territories in some systematic way (e.gtj on the hasis of 
test scores or performance during training)j the criterion measure will 
he hiased. 

A classical example ^f criterion contamination in testing occurs when 
the criterion measure is a ratingj and the person doing the rating knows 
the examinee's score on the predictor test. A supervisorj for exaji^lej 
may know that a particular siibordinate obtained a high score on a t^election 
testj and this knowledge might influence the supervisor (consciously or 
not) to rate the subordinate higher than his on-the-job behavior would 
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otherwise merit t In such a case the test would appear to be spuriously 
better than It actually was* 



A study by Hughes (reft 73) lllustratea a possible opportunity for 
criterion contaiQlnatlon to occur In programming , He used both program- 
trained and "traditionally" trained groups of studentSj and evaluated 
"the effectiveness of each type of training by means of an essay-type 
posttestt If the judges who marked the essays were aware of the group 
from which the writer of each essay came (as seems llkely)j then cri- 
terion contamination may have been introduced. Hughes is not explicit 
on this point t 

Another programming situation in which criterion contandnation may 
occur arises when the programmer is interested both in a measure of 
learning just after training is completed and in a measure of how well 
the learning is retained* Ctoe design he might use to com[pare programmed 
and traditional instruction groups on both learning and retention is 
shown below as Z^esign (l). 

Design (l) 

Programmed instruction Posttest Retention Test 

Traditional instruction Posttest Retention Test 
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Bie programmer wo\ild randomly assign learners to either programmed or 
traditional instruction and test them after instruction and again some 
time later* If he used this desigOj the collection of posttest data 
might produce contamination of retention test criterion data* IDiis con- 
tamination might come about if the posttest sensitized the learners to 
the test ite^s which they would again be exposed to on -,ie retention test* 
As a resiilt of such sensitisationj they might discuss and think about the 
test items during the time between the posttest and the retention test* 
!Hiis extra experience with the items coiild then be reflected in retention 
test performance. If the programmer did not intend to give the posttest 
and retention tests in operational use of the programj the trainees would 
not get this extra experience during operational conditionsj and the 
evaluation of the program would have presented too favorable a picture 
of it* 

Ocie way for the programmer to avoid criterion contamination in this 
situation would be for him to use Design (2): 

Design (2) 

A Programmed instruction Posttest 

B Programmed instruction Retention test 

C Traditional instruction Posttest 

D Traditional instruction Retention test 

Time ^ 

Design (2) differs from Design (l) in that in Design (2)j after train- 
ing; each group i^s randomly subdivided into two subgroups^ one of these 
subgroups receives a posttestj and the other subgroup receives a retention 
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test* The programmer would compare Group A with Group C In order to 
determine the relative merits of prograimied and traditional Instruction 
for leetmlngj and Group B with Group I) in order to determine their 
relative merits for retention* 

Another way for the programmer to avoid criterion contamination 
In this situation would be to use an alternate form of the posttest as 
a retention test* The study of Gagn^ and Dick (ref* 52) Illustrates 
the use of alternate forms of a test in conjunction with a program* 
For further discussion of "parallel tests" and "eq^ulvalent cests** see 
Gulllksen (ref* 65) and Thomdlke (ref * 132), for a discussion of 
"randomly parallel** testSj see Lord (ref* 93)» 

In addition to criterion deficiency and criterion contamination* 
a third type of tias is criterion acale unit Ijlas * Suppose that one 
were interested in using the sales records made ty salesmen as a 
criterion for a test designed to select good salesmen* If one merely 
counted how many sales each salesman madej these criterion data might 
not te too meaningful. This is "because each sale would nox be of equal 
value to the company employing the salesinen* On the other handj the 
total volume of sales made by each salesman orj better stillj the total 
profit to the company in the sales made by each salesman would provide 
more meaninf/ul criterion data* Such data would be more afeaningfxil 
because the company is not ultimately Interested in the number of sales 
made by each salesraauj but rather in the profit each salesman produces* 
It does not cetre whether Salesman A made more sales this year than he 
did last yearj but rather whether his sales resulted in more profit this 
year than last* It does not care whether Salesman B made more sales 
than Salesman C this yearj but rather whether Salesman B*s sales resulted 
in more profit than did Salesman C's sales* Any one sale is not neces- 
sarily as e<iuaily valuable to the company as any one other salej so the 
use of the number of sales made by the salesman as a criterion measure 
would result in what the test constructor would call scale unit bias* 
Since any dollar of profit i£ as equally valuable to the company as any 
other dollar of profitj the use of the amotmt of profit produced by 
each salesman's sales would provide a criterion in which each unit 
(dollar of profit) produced by a salesman was just as important to the 
conqpany as apy other \mlt produced either by the same or by a different 
salesman (see Brogden and Taylor j ref* 11 )* ' 

In many sitmtions the programmer may also find the dollar criterion 
will be useful in providing him with an eq*^ unit criterion scale* If j 
for examplej he wanted to develop a program which would train people to 
be good salesmenj he would use total profit in the sales made by each 
trainee (not number of sales made) as an equal unit criterion measure* 
In some situatlonSj howeverj the programmer (and the test constructor) 
may not find it easy to use a meaningful equal unit scale such as money,* 
Consider a case in which his only available proximate criterion is the 
number of items right on a 50-1 tern achievement test* The programmer 
could consider this measure to be on a meaningful equal unit scale if 
it were equally important to him for Trainee A to get q^uestlon 1 right 



Some of the complexities of this problem of units in learning 
situations are discussed by DuBols (ref* 30)* 
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as for Trainee B to get It rl^tj and for Trainee A to get question 1 
right as for Trainee A to get question 2 rlghtj etc. In one programing 
studyj Jones (ref » 80) apparently felt that a gain from p^^- to posttest 
of 1056 of the group on one Item wag equally luiportant to liin* as a gain 
from pre- to posttest of IO56 of the gjroup on another Item. In many 
casesj howeverj the programmer may have conalderatle difficulty In 
deciding whether gains on different Items or ty different learners are 
equally Important to him. One meaningful tasls he ml^t use for sxt^h 
decisions Is how costly It Is to trlng atout gains on different Items 
Or for different people ty the test available alternative training 
method. This Informatlonj unfortunatelyj may seldom te avalleti3^» 

We have seen that In choosing a criterion measure for either a test 
Or a program there are several types of tlas to te avoided* When one Is 
confident he has avoided thesej and has a measure Or measures which he 
Is Interested Inj he Is ready to consider the reliability of his measures* 
The reliability of a test laeasure refers to the consistency with which 
It yields results. This consistency may be over tlir^j ai> when thp test 
On different occasions yields similar results; Or alternate-form con- 
slstencyj as when different versions of a test yield similar results; 
Or Internal conslstencyj as when the component items of a test yield 
similar results** 

If our criterion measure Is a ratlngj as when a supervisor judges 
the quality of a worker's performancej we would want the rating to be 
the same whether It Is made at 8 a*m* or at k p*m.j and perhaps whether 
It Is made this month or next month* ^Rils would be consistency over 
tlmej Or "test-retest" reliability. We would also want the rating to be 
the same even if the worker had had a different supervisor to rate him* 
Slmllarlyj if a total score on a test is a criterion measurej neither 
would we want the total score to vary greatly depending on when it is 
glvenj norj if alternate forms of the test were avallab].ej would we want 
the total score to vary greatly depending on what form of the test is 
taken* For a discussion of factors influencing the reliability of a 
testj see Thorndike (ref , 131); for a discussion of the reliability of 
performance tests of an essentially nonverbal naturej see Ryajis and 
Frederlksen (ref, II6)* 

We have seen that the test constructor (and the programmer) must 
concern himself with questions of relevance^ possible blaiij and relia- 
bility of criteria. In the happy eveijt that he finds himself with more 
than one relevantj blas-freej and reliable crlterlonj how might he 
proceed? 

In some cases criteria of the same general nature may correlate 
rather hl^ly. In these cases one may choose to work with one of several 
possible measures on the basis of convenience or economy* French 
(ref . ^7)j for exampl^j found that average freshman grades in college 
were highly correlated with average four-year grades in college » Tiiis 
meant that average freshman gradeSj which become available three years 



This is a rather sln^llfled view of reliability. For further 
dlscusslonj see Gulllksen (ref. 65) and Thorndlke (ref* 132)* 
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before average four-yeetr erades, couM as criteria for new tests 

or scholastic aptitude. 

In the atove example, average freshitan grades and average four*year 
grades were of the same general nature, that Is, In each case the jjieasure 
was "based upon grades assigned Instructors* When, however, criteria 
are not of the sauie general nature, they my not correlate hl^ihly at all* 
Consider the dissimilar criteria of speed of performance and quality of 
performance* She fastest worlters may not^ of course, do the "best work* 
How, then^ can the test cons trw tor, who attempts to predict performapce^ 
and the programmer, who attempts to produce performance, deal with toth 
speed and quality of performance? 

Brog<ien and Taylor (ref . 3-1^ p* Ihl) suggest that the cost accounting 
principle may te useful In combining dissimilar sut-crlterlon measures 
Into a single criterion measure* **A tracing^ out of the exact nature and 
Importance of the effect of each su'b'-crlterlon variable on the efficiency 
of the organization Is the essential step which differentiates the dollar 
criterion from the more conventional techniques*** Once again the dolletr 
Is suggested as a meaningful unit* Wot only may It provide liie programmer 
with an equal unit scale^ tut It may also permit him to compare and com* 
tine measures (such ar time and quality of performance) which do not 
appear to otherwise te comparable - 

How might the dollar criterion te used? In an Industrial setting 
employee time can te given a dollar value in terms of wages, "benefits, 
equipment costs^ overhead, etc^ If a product Is telng produced^ It too 
can te given a dollar value In terms of the margin of profit In Its sale * 
Then, when a faster worker also Is a less accurate worker, that Is^ when 
he turns out more units in a given time hut they are of Inferior quality 
to those of other wor^'ers^ we can combine these two different aspects of 
his performance into a single measure of cost which will he directly com- 
parable with the single measures of cost of other workers- For a dis- 
cussion of this procedure with numerical exaniples^ see Brogden and Taylor 
(ref* 11)* 

As we saw In the discussion of obtaining meaningful equal unit cri- 
terion scales, sometimes the programmer will find It harder to apply the 
do3-lar criterion than other times. The dollar criterion will not he so 
easily appllcahle within the Industrial setting when what the worker does 
cannot he directly related to a tanglhle product, or agaln^ In educational 
and military settings. As Cronhach and Gleser put It: **31ie assignment 
of values to outcomes Is the Achilles heel of decision theory** (ref* 23, 
p* 109)* !Riey point out^ however, that any procedure for evaluating 
outcomes and making decisions Involves this assignment of values^ and so 
it may we3-l he deslrahle to make thls'^a^ssignroent explicit to one's self 
and to others* 
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Betenolnlng the Resources J&allable 



Ttie test constructor who has spelled out his objectives and has 
developed criteriqi^i nieasures is in a position to assess the resources 
available to him for teat construction. If he is developing a selection 
testj he will first want to consider how well he can presently predict 
the behavior of interest to hlnij and then consider how this prediction 
can be improved i^on. He may already knowj for exanrplej that good sales- 
men have above average verbal ability. He may wonder what personality 
traits can be used to characterize successful and unsuccessful salesmen. 
Ife may proceed to first-hand observations of salesmen's behaviorj he may 
ta3Ji to salesmen and their sitpervisors to find out what they feel leads 
to success and failure in sellingj and he may look at existing records 
or collect new information on the characteristics of successfxjl and un- 
successful salesmen. ?or ct more detailed discussion of Job analysis 
proceduresj see OJiorndike (ref. 131j pp. 12-31). 

Ilie programmer faces a somewhat different problem at this stage. 
Since he wants to modify behaviorj not predict itj at this stage he'^wants 
to assess how close his trainees now are to having the terminal (criterion) 
behavior. Carr (ref. I5j pp. 557-558) has saidj 

"The prograiamer must also specify precisely the Initial 
S-R connectionSj i.e.j those connections already in the 
learner's repertory which approximate the terminal S-R 
connections and from which the transitional S-R con- 
nections are to be developed. .. .To the writer*s knowl- 
edgej no research has been done on the problem of 
specifying the initial S-R connections on which the 
program is to be built." 

The problem is this; of the large number of S-R connections which 
the learners possess at the beginning of trainingj which ones are rele- 
vant to the prograjnmerj that iSj which ones need to be built upon to 
produce the criterion behavior? For examplej in the programming of auto- 
mobile drivingj the learners" initial knowledge of French and of archi- 
tecture may be obviously irrelevantj and their knowledge of traffic laws 
obviously relevant. But what about their knowledge of how the car*s 
engine works j of how to use the meters on the dashboard^ and of their 
verbal knowledge of the relationship between driving speed and stopping 
distance? In the latter cases it may not be so obvious whether these 
behaviors are relevant to the criterionj andj if soj what other behaviors 
are to be built upon them. One approach to this problem will be discussed 
in Section , 

Both the test constructor and the programmer face certain limitations 
in their respective efforts to predict and to modify behavior. These 
limitations can be grouped into (a) limitations dxiring test or program 
developmentj and (b) limitations during operational use of the test or 
program. 

(a) Limitations During Development 

The limitations the test constructor or programmer will usually 
encounter during development are limitations of tiraej personnelj and 
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egiilpaient . When a test or program Is needed In a hurryj or when the 
needed personnel or equipment are not avallahlej the test constructor 
(or progrannner) inay he forced to do an Inadequate joh of such things 
as specifying ohjectlvesj pretestlngj revlslonj and the collection of 
criterion data. As we have already seenj the specification of objectives 
Is an extremely Important aspect of test and program constructlonj and If 
the test constructor or programmer does an Inadequate joh here he cannot 
usually correct this on the Ijasls of later Information. 

When pretesting and revision are cuartalledj the liapllcatlons for 
testing inay joot he as severe as the Implications for prograrmlng. In 
testlngj If the population of test-takers Is large enough and the time 
available for test-taking long enoughj the test constructor can administer 
more Items or tests than will eventually prove useful. He can then ascer- 
tain from a sample of the examinees' test papers which etre the useful Items 
and tests; from another sample he can check on the usefulness of" these Items 
and tests (cross-validation); and then for the remainder of tt^e examinee 
population he need score only those items and tests which have been found 
to be useful. In programmingj however j the items (frames) are not conceived 
to be independent^ but rather are cumulative; that isj in programming the 
effects of e:qK)sure to individual frames on criterion behavior cannot 
usually be isolated. Since frame revision on the basis of tryout with 
students is considered to be a vital part of programming (refs. 69j 86)j 
we may expect that limitations in resources which curtail tryout and re*- 
vision will seriously affect the usefulness of programmed instruction. 

(b) Limitations During Operational Use 

V 

Limitations on time available for testing during operational use are 
also somewhat different from the limitations on time available for students 
to go throxigh a program. In t^stingj we conceive of the measurement as tak- 
ing place at one instant in timej although the actual test administration 
may last several hours. In programmingj we conceive of the instruction as 
taking place over a finite period of time. When testing time is llmitedj 
the test will consist of fewer itemsj and we may expect the reliability of 
the test to suffer. When learning time is llmitedj it is not clear whether 
the program should consist of fewer frames. Evansj Glaserj and Homme 
(ref. 39) reported that when more frames were added to a programj the amount 
of time taken per frame decreased. This presumably reflects the fact that 
when more fram^^s were usedj the "steps" between the frames were made smaller 
and therefore could be taken more easily and quickly.* Hblland (ref. 69) 
also reports that when & program was revised and lengthened on the basis of 
student responses during tryoutj total time to go through the program was 
reduced. 

If training time is limited so that fewer frames are used and larger 
steps must be takenj some learners may not be able to take these stepSj and 
after some point in the program they will be unable to benefit from the later 
frames. If the revision suggests that additional frames are neededj Imt time 
limitations prevent their being usedj a ma^or advantage of programmed in- 
struction may be lost. 



For a thorough discussion of the concept of size of stepj see 
Lumadaine (ref. 9*^)* oc 
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A second possible limitation during operational uSe of tests and 
programs is a limitation on supervisory personnel. In testing, when a 
test is intended to ba given by ina^y relatively* Untrained proctors, the 
test constructor must make the administration a simple process. In such 
an Instance he ml^t not have separately timed sections of the test, for 
example. In programming, if no proctor i<? around and programed texts 
are used, it may be possible for the learners to **cheat'* by looking ahead 
at the answers before deciding upon their ovm answers. We do not know 
whether this impedes learning; research is needed on this question* The 
use of machines meiy prevent cheating but may introduce the need for person- 
nel to maintain the machines. The programmer shotad anticipate the need 
for some supervisory personnel with '*auto*'"instnK;tion. 

A third possible limitation dxiring operational use of tests and 
programs is a limitation in scoring facilities. Tests may be scored by 
machine, by the examinee himself, or by another person. Traxler (ref. 135) 
gives detailed considerations in the decision to have machine or himian 
scoring. In programming, the learner*s answer is compared with the "cor- 
rect" answer not for the purpose of scoring but for the purpose of providing 
him with knowledge of restxlts, that is, for the purpose of telling him 
whether he was right or not.* This comparison can be made by the device 
(from progrtonmed textbook to conqputer -controlled instructional system) if 
the format of the frame is multiple choice, that is, if the learner is to 
choose from a specified set of alternatives. If, however, the format of 
the frame calls for a constnjcted response which the learner is to compose 
himself, a problem arises in the coo!parison of his answer with the correct 
answer. The nature of this problem will be discussed in a later section 
on item and frame writing. 

Planning and Developing Items (Frames) 

When the test constructor has sufficiently specified his objectives 
and noted what resources are available to him, he is ready to prepare a 
preliminary version of the test. One of the first considerations will be 
that of the. scope, or extent of coverage, of the test. If the test is in- 
tended to predict success in college as measured by grade-point average, 
should the test include items intended to get at certain personality charac- 
teristics (e.g., perseverance, skill in interpersonal relations) as well as 
at certain cognitive characteristics (e.g., verbal ability, mathematical 
ability )? The test constructor will attempt to cover those areas which 
seem important and which his resources allow him to cover. 

When a test is intended to measure achievement in an academic area, 
a procedure is sometimes followed from which we can derive a precaution 
for programming. This procedure isj&ased upon a distinction made between 
"subject matter" on the one hand and "ability" or '*process*' on the other. 
Ferris (ref. kl)^ for exainple, in working out specifications for a new 
physics achievement test with subject mtter experts, considered such 



An additional purpose may be to provide information which can be 
used to determine what material the learner is to be presented with next. 
This will be considered further in Section 

26 
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things as tlmej mssj geometric optlcsj and conservation of energy as 
subject mtter toplcsj and the ability xo demonstrate qualitative under- 
standing of fundamentalsj to apply knowledge to xinfamillar situations j 
and to draw valid conclusions from observation and data as abilities* 
The subject matters and abilities are laid out In a two-way grid Or 
matrix (see Vaugbnj ref* l36)* The procedure Involves specifying the 
number of achleveinent test Items which are to be developed for each of 
the Intersections of subject matter and ability categories (e.g»j geo- 
metric optics and ability to ajfply knowledge to unfamiliar situations)* 
These nuiribers should reflect the test constructor's relative Interest 
in the various subject matter and ability categories. Such a specifi- 
cation may be usel'ul to the test constructor because it prods him into 
writing items to cover that in which he is Interestedj rather than merely 
writing items for those categories in which item-writing is easy. The 
specification may also be useful to other people who may wish to use the 
testj since it communicates the coverage of the test to them. 

An Important point to remember is that although the test constructor 
denotes his relative Interest in each category by the number cf iteir^ he 
assigns to Itj the number of items gives no more than a rou^ indication 
of the actual contribution of each category to total test score. This 
contribution will depend not only on the nuiriber of items in the categoryj 
but also on the standard deviation of the subscores from the category and 
m the correlation of these subscores with the subscores from the other 
categories in the test. 

How is this related to programming? Evansj daser and Homme (ref» kO) 
have suggested that in setting up specifications for s^ programj one should 
also make use of a matrix— a "Ruleg" matrix* Presumably this matrix would 
also serve the functions of Insuring that the programmer cover that which 
he Intends to coverj and communicating to others what it coversj as well 
as the function Evans et al > mention of getting the programmer to inter- 
relate the concepts in the program. An important point for the programmer 
to remember is that the numbers of frames devoted to each rxile may only 
be roughly proportional to how well eacfi mile is learned* It would seem 
that different concepts will require different numbers of frames to be 
thoroughly understood » We already know that in the relatively simple case 
of learning paired-associates that different I^irs require different 
amounts of practice (e»g»j see ref» 85)» 

Item (Frame) Format Specifications 

The test constructor who has spelled out the specifications for his 
test is ready to consider the item format or formats wfiich he will use* 
His choice among different item formats may already have been limited if 
his assessment of resources Indicated that the test muat be scorable by 
machine. The choice of formats he makes will probably reflect whatj for 
hlmj is an unhappy merging of both "practical" considerations j e.g,j ease 
of writing ItemSj ease of scoringj and certain "theoretical" considerations j 
e,g»j "I don't think I can test writing ability with multiples-choice items/' 
While theoretical rationales might be developed for using or for not using 
any format for any purposej we do not know whether certain formats are in- 
herently more desirable than others in all situatlonsj or even inherently 
more desirable ^han others In a particular situation: 
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"...any such characteristic differences (in reliability 
ani validity) as my exist among Item forms sxe of 
trivial conseq.uence when conipared ylth the extreaie dif- 
ferences observed among Items of the saniG form." 
(Ref. 33. p. 1Q9.) 

In programmlngj It appears that characteristic differences %lBOng frame 
formats may also be of trivial conseq.uence when compared to the differences 
observed among frames of the same format, available evidence on multi- 

ple choice vs. constructed response frame formats (e.g.j see refs. 37j 
113) does not show outstanding differences between these two formats.* 
Wor voulcUlt be clear what it meant Ifj sayj each of these stttdles shoved 
the constructed response format to be clearly superior. We cannot specify 
the relevant dimensions along yhlch multiple-choice and completion formats 
might differ J so ve cannot sample these dlobsnslons to obtain generall^able 
restxlts . 

Writlnc Items and Frames 

For the most partj the suggestions available In the literature for 
writing test items are based upon informalj uncontrolled observations j 
"folklorej" "common sense" considerations j etc. In one study Dunn and 
Goldstein (ref. 31) tried to systematically evaluate some of the tradition- 
ally accepted rules for writing test Items. rules dealt with "Incom- 
plete statement versus q.uestlon Xeadj absence or presence of specific 
determiners or cues to the correct altemativej alternatives of eq.ual length 
versus extra-long correct altemativej and consistency or inconsistency In 
grammar between lead and alternatives." ^elr findings gave no support to 
apy of the four rules with which they worked. 

SuggestloES for writing auto-lnstructlonal frames which are derived 
from informalj uncontrolled observatlonSj "folklorej" "common sense" con- 
sideratlonSj etc.j ore also available (e.g.j see refs. 55j 86). These 
suggestions J like the suggestions for writing test items j ore of unknown 
validity. Furthermorej just as the finding of Dunn and Goldstein had 
rather negative Implications for test Item writlngj there Is a study which 
has rather negative Implications for frame writing. Newman (ref. IO3) com- 
pared a group of students whose sttbdy materials were seq.uenced and con- 
trolled In accordance with principles derived from learning research with 
a group of students who used their own study technlq.ueSj and foimd that 
the group using their own study technlq.ues leamed more. We do not knowj 
of coursej how for we can validly generalize this finding. But the finding 
should serve to caution programmers against a rigid adherence to Insuf- 
ficiently tested rules for the construction of program frames j j\ist as the 



Frederlksen (ref. ^5) has worked with a new response mode In testing 
wMch Incorporates features of both multiples-choice and constructed response 
formats. S constructs his answerj then views £'s alternatives and "chooses 
the one which best approximates his response."^ In this way any advantage 
of S constructing his response Is obtained j while the problem of scoring 
constructed responses Is minimized. Gilbert (in ref. 95j Pp. 5^5-5^6) has 
suggested that such a response mode be developsd for programming. 
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study of Dunn and CiOldsteln should serve to caution test constnjctors in 
a similar way. With this background of caution and skepticisnij let us 
look at some suggested rules of unknov/n validity for writing test itemSj 
and see what implications they might have for writing program frames . 

Discussions of how to write test items are available in Ebel (ref. 33) 
and Travers (ref. 13*^)* According to Ebelj the most important suggestion 
is to "express the item as clearly as possible.** In this contend "clearly" 
means unambiguously and understandably. "Test items should not be verbal 
puz2les. They should indicate whether the student can produce the answerj 
not whether he can understand the question** (ref. 33j P* 213). 

It would seem that, in writing a program it is also important to strive 
for clearj unaaiibiguofus fraaies. The programmer might use certain available 
procedures which have been developed to assess the "readability'* of his 
frames. The Flesch count (ref. kk) measures readability by combining 
n^asures of average sentence l^ngth and nuHftjer of syllables per word^ while 
the Dale-Chall count (ref. 28) measures readability by combining measures 
of average sentence length and relative frequency of words not on a list 
of 3000 easy words. Both counts yield similar results^ and the choice be- 
tween them may be made on the basis of convenience. Dale and Chall point 
out some limitations in the lose of this type of count.* The programmer 
might use such a count to make his program more readable before he tries 
it out on students. Research is needed to establish whether this is a 
feasible way to improve programs. One group of students mi^t be given a 
first draft of a programj and another group of comparable students given 
a draft of the program which has been revised on the basis of readability 
count. The two groups would then be compared on time taken to go through 
the program and on posttest achievement ► 

Oae difference between test items and program frames that we should 
keep in mind when we try to apply item writing suggestions to frame writing 
is that in general each test item Is self- contained j that iSj it must be 
understood by the examinee when it occurs alone; while each program frame 
occurs in the contend of other frames j and these other frames may serve to 
clarify its meaning. Consider this frame: "Some errors possible in at- 
tempting a(n) ^response are errors in Contentj Language^ Depthj and 

ifeaning" (from Ellison et al .j in ref. 111^ p. 99). 

If this were a test Itemj it would not be too clear what is called 
for. In the contend of the program in which It actually occursj howeverj 
the preceding frames serve to clarify its meaning. This example suggests 
that some of the more specific suggestions for writing tests (for which 
the en^irical basis is not too secure) may not be directly applicable to 
writing frames. 

Let up_^ look at some of these suggestions . 

1. **Avoid including two or more ideas in one statement** (ref. 13^j p. 5^). 



An additional limitation for the programmer is that a readability 
measure may be inappropriate when technical vocabulary Is to be tai^t. 
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In testing^ when an examinee cannot answer an item which deals with 
several ideas, we have no way of knowing which of the ideas he cannot 
handle* From this point of view it Diay be xmdesirable to use such an item* 
In one programming study, however, Severin (ret* 119) foxmd that the use 
of "two pairs** frames, which contained two Russian ^English vocabulary word 
pairs per frame, resulted in jitore learning than did the use of a two" 
alternative multiple -choice frame, which contained only one Russian ••English 
vocabulary word pair per frame* In this specific instance, therefore, the 
suggestion for test item writing does not seem to hold for program frame 
writing* 

2* **Avoid the inclusion of nonfunctional words in the item** (ref* 33, p* 215 )» 

^el considers a word in a test item to be nonfunctional **when it does 
not contribute to the basis for choice of a response*" Holland makes a 
similar suggestion for program writing; "It is probably an adequate rule of 
thumb to say that any portion of an item which is not necessary for the 
student to arrive at a correct answer cannot safely be assxmied to be taught 
by the item" (ref* 70)* 

Some suggestions are specific to constructed-response items: 

3* ^'Direct questions are probably preferable to incon^lete declarative 
sentences, especially for younger, less 'test-sophisticated' pupils, because 
the former are more similar to the forms in which ordinary discourse is 
carried on. 

Faulty: America. was discovered in the year 7 

Improved: In what year was America discovered?" (ref* 110, p* 8l)* 

This is one of the four points for which Dtnn and Goldstein found no 
empirical support* We will not consider its iuiplications for writing frames* 

"Keep the ratio of words given to words omitted very high because, if 
too many words are omitted, the meaning of the whole wiU be obscure'* 
(ref* 13*^, p* hi). 

This suggestion would seem to also apply to programming* Below are two 
examples of frames written by programmers who aim at little or no learner 
error* In each case the substantial proportion of errors made may be due 
to the violation of the above suggestion* 

"A child has a 'temper tantrum' screaming for candy* The mother gives 
the child the candy, and the tantnim ceases* The mother's response of hand- 
ing the candy to the child is by the of the tantnuE*' (ref* ,68, 

p* 78)* Fifty-six percent of all the learners got both answers correct* 

* 

*'LEARMING is indicated by any 'change* in to a situation which 

is the result of responses to the same or similar , a _ 

not nullified to any degree by an extended of during which 

neither that nor any similar situation is presented*' (ref* 5, p* 190)* The 
percentages of all learners who correctly filled in these blanks were 965^, 
^3?£j 86^, 57^&j 86^ and So?6, respectively. 

The ratio of words given to words omitted may be a rather coarse index 
of how obscure a frame is, ****,one natiop, under God, with 
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may be an easler-to-complete frame than the franie from Barlow quoted abovej 
yet such an Index yould rate It as more obscure. Perhaps rather than trying 
to formulate a rule regarding the ratio of vords given to vords ojulttedj ve 
should only conclude that vhen a fraiue gives students troublej the progrannner 
should consider the possibility that too many iiaportant vords are omitted. 

5. "The blanks should refer only to omitted key vords" (ref . 13lfj p. If2). 

Holland (ref. 70) compared a group given a program vith key vords omitted 
in each frame vith a group given a program vlth "trivial" vords omitted in 
each frame. Sample frames from each of these versions of the program vere: 
A technical term for "revard" is reinforcement. To "revard" an organism vlth 

food is to it vith food, (key void omitted) A technical term for 

"revetrd" is reinforcement. To "revetrd" an organism vith food is to reinforce 

it vith^ . (trivial vord omitted) The group vith the key vords omitted 

did better on the posttest. Similar lyj Jones (ref. 8o)j vorking vith a 
multiple-choice fo3:Tiiatj concludes that the correct ansver should not be 
"trivial." "The good item may be characterized as... one vhich cannot be 
ansvered by reasoning or knovledge of vocabulary alone" (ref. 80j p. 99). 

6. "Specify the terms in vhich the response is to be given. 
Faulty: Where is the vorld's tallest building located? 

Improved: In what city is the vorld's tallest building located?" (ref. IIO). 

The reasoning behind this suggestion is that it is hard to anticipate 
all possible answers to a completion item (e.g.j "North Anoerica" might be 
an ansver to the faulty version of the above item)j and so it is useful for 
the test constructor to state the form the ansver is to take (in the above 
itemj the improved version specifies that the name of a city is vanted). With 
mathematical subject matter it may also be necessary to state the degree of 
precision vanted in the ansi^er^ e.g.j the nuinber of significant figures. 

In following this suggestionj hoveverj the test constructor is not to 
choose just any method of specifying the terms in vhich the response is to 
be given: "Hints concerning the correct answerj in the form of the first 
letter of a wordj or a number indicating the nuntoer of letters in a wordj 
should generally not be employed. Such hints may tend to confuse pupils vhen 
the answer upon which they have decidedj although it is a correct synonymj 
does not coincide vith the given hint. Guessing and responses to superficial 
cues may also result from this practice" (ref. llOj p. 82). 

In programming J it may be particularly important to specify the terms 
in which the response is to be given. As ve have seen in the discussion of 
limitations during operational use of the programj the problem of hov to 
tell the leetrner that he is correct may arise when a constructed response 
format is used* Programmers have generally left it to the student to coinpare 
his response with the "correct" response. This leaves it up to the student 
to recognize that his responsej vhich may be stated in different terms than 
the "correct" responsej is essentially equivalent to it. This extra burden 
on the student may be relieved by specifying the terms in vhich the response 
is to be madej e.g.j 

If A = Ij 2j 3 and B - If j 5> then A is not equal to . 

(Use one letter for your ansver) q | 
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(from ref. 35j p. 15 ) or by using interchangeable synonyms when providing 

knowledge of resultSj e.g.j "Latency is the between the onset of an 

energy change and the onset of a response which it elicits." Time (intervalj 
period) (from ref. Tlj p. k). * 

Some suggestions are specific to multiple -choice items. 

7. If you want to increase (decrease) the difficulty of an itemj make the 
distractors more homogeneous (heterogeneous). Remmers and Gage (ref. IIO) 
give this example: "Which city is nearest to Chicago? (l) Los AngeleSj 

(2) New Yorkj (3) St. LouiSj {k) Miamij (less homogeneous); (l) Minneapolis, 
(2) St. LouiSj (3) Cleveiandj (k) Mllwaukeej (more homogeneous)." The 
programmer who is interested in a gradual "shaping" of behavior within a 
multiple -choice format might progressively increase the homogeneity of the 
alternatives in a series of frames. 

8. When it is difficult to anticipate what mistakes will be made in answering 
an itemj do not use "none of these" as the correct answerj since both, people 
who are correct and people who make unanticipated mistakes will choose it 
(see ref..33j P- 237)* Consider the following frame: 



means the same as: 



X^X ^ Y-Y.Y (XY)"^ X>X-X none of these 



Y*Y 



(from EvanSj ref. 38). 



The learner who makes any mistake other than (A)j (B)j or (C)j e.g.j 

-1 X.X 
(XY)" J as well as the learner who knows the correct answerj ( y^^.y ) j will 

both choose (D)» If mistakes other than (a)j (B)jand (C) are at all commonj 
this might be a poor frame. 

9» "Make all distractors plausible and attractive to examinees who lack the 

information or ability tested by the item"* (ref. 33j p» 2^h). 

Pressey feels that in an instructional test (auto -instructional program) j 
the distractors might be more than just plausible and attractive. 

"Each wrong answer should be one against which a warning 
is neededj or which elucidates the question in some way. 
No alternative answer should confuse the student or intro- 
duce ways of construing the question which are not edu- 
cationally profitable to consider. Poor alternatives waste 
time both in taking the test and in discussion after^ and 




Hiiose who reject the multiple-choice format would find the use of this 
test construction rule in programming to be particularly objectionable; 

» ^effective multiple-choice material must contain plausible wrong responseSj 
which are out of place in the delicate process of 'shaping' behavior because 
they streng:then unwanted forms " (ref, 123j pp* litO-lltl), As we saw earlier 
(page 25)j we have no firm basis for favoring either multiple -choice or com- 
pletion formats in all situationsj and It is not clear that we ever will. 
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might confuse the learner rather than help him*' 
(ref. 106j p. k22). 

Pressey's statement su/;gests that the number of alternatives for a 
multiple-choice frame should depend on the content of the particular framej 
and that all the frames In a program need not have the sajtne nximber of alterna* 
tlves. On the other handj some research In programming (refs. 8lj II9) has 
attempted to compare differing numbers of alternatives as an Independent 
variable. Since this *'variable" may actually be a complex of variables (e.g. j 
popularity of altematlvesj similarity of alternatives )j the results of these 
studies should be cautiously Interpreted. 

llie suggestions for test item writing given here^ which do not e^dhaust 
the supply of all possible suggestlonsj indicate that test construction is 
a complexj highly skilled activity. This, in turnj might suggest that it be 
carried out by a professional test constructor. IMless the subject matter 
is very slmplej howeverj the professional test constructor may eniphasisie 
relatively trlvlalj easily testable aspects of it and neglect its basic 
structure (ref. 136). Collaboratlan with a subject matter expert may help 
to eliminate this danger. 

Good programming is also thought to Involve both subject matter mastery 
and programming ability (ref. 111). Because of the relative newness of 
programming^ talent for it may be unavallablej or perhaps unlmown to those 
possessing it. 

Rigney and Fry indicate that one skill the programmer mwst develop is 
that of going slowly^ of proceeding in small steps^ "...(the beginning pro- 
grammer) is quite likely to write the first version of his program with steps 
that are inapproprlatej too dlfflcultj and too few for the material" 
(ref. lllj p. ik). Other programniers have expressed similar sentiments. 

The taxonott^ of educational objectives prepared by Bloom et al . (ref. 8) 
may be helpfiil in this connection. The taxonony is based xq^n six major 
classes of objectives: Itoowledgej Comprehenslonj A|ipllcatlonj AnalyslBj 
Si/ntheslSj and Evaluation. As just glvenj they are assxmied to be in hier- 
archical orderj that iSj the objectives in one class are "likely to make use 
of and be built on the behaviors found in the preceding classes in this list" 
(ref. 8j p. 18). 

ELoom et al . present sample items and invite the reader to classify them 
as to objectlvej using their taxonotny. ^Rils type of task might be useful as 
a test item in a test used to select programmers. Potential programmers who 
consistently underestimate the level of objectlveSj wouIiS, presumably write 
items that were too difficult. This type of task might also be useful in 
training programmers. 

Research is needed on the extent to which "exp^Tts" agree in classifying 
items in this taxonomyj andj of course^ <in whether the classes of objectives 
are actually hierarchically ofderectv^ jWe will further discuss the question of 
ordering behavioral skills ^n Section 
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Pretesting (Tryout) and Revision 

When the test Is constructedj the next step Is to pretest Itj or to 
try it out* The test constructor should first try It out on his col- 
leaguesj who my offer suggestions concerning fonnatj editorial considers- 
tlonSj amblgultleSj and Inaccuracies* The term "pretest^" howeverj iisually 
refers to the trying out of the teat material ou manibers of tha popwlatlon 
for which It Is Intended* Such a pretest may serve several purposes* It 
may uncover weaknesses In Instructions ani formatj and provide Information 
for establishing time limits j for establishing a desirable test lengthy and 
for imrproving and selecting items. 

In programmingj several recent reports (refs* 19j 59> 122) indicate 
that when the learner is not required to make any overt response in going 
through the programj learning does not suffer and learning time inay be 
decreased. While we do not know whether this finding will hold up vith 
learners not highly motivated by taking part in an experiment j it does sxig- 
gest that under certain circumstances the learner's overt responses are not 
necessary during operational use of the program* It would still seem to be 
hl^ly desirable for the learner to make overt responses during the tryout 
of the programj howevBrj so tjiat they could serve as a T^asis for revising 
the program. 

We will now look at various aspects of pretesting and revision in test 
construction * In each case we will see what implications may be derived 
for programming. 

Instructions and Format 

The test constructor may vtse a small number of peojxLe and perhaps a 
typewritten draft of the test when he attempts to uncover weaknesses in its 
instructions and format* Conrad (ref * 17) refers to this stage as a "pre- 
tryout*" During pretryout the instructions may prove to be incompletej 
ambiguous J or otherwise deficient* 

A pretryout stage seems desirable in program development too* The 
leamerj Who may or may not be familiar with some of the more commonly used 
testing proceduresj will aluKJSt Invariably be unfamiliar with the programming 
procedure (which will Include the novel feature of knowledge of resultsj and 
possibly other novel featuresj such as branching)* The programmer's in- 
structions will aim at acquainting the learner with programming procedures^ 
but various misunderstandings on the part of the learner may occur and be 
revealed T^y a pretryout * 

In addition to format weaknesses in Instructlonsj a pretryout may also 
Uncover weaknesses in how the test was put together. For examplej the infor- 
mation needed to answer a question may be on the previous page in the test 
bookletj or one particular response position may T^e correct much more than 
its proportionate share of the timej etc- These weaknesses in putting the 
test together will be called weaknesses in format* 

A test Biay also be considered weak in format when response sets are 
allowed to operate* Response sets may be defined as tendencies of subjects 
to respond in ways which defeat the purpose of the measurement* For examplej 
one response setj "acquiescence/' is the tendency to agree with a statement 
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regardless of its content. The set to gamble is the tendency to*guess 
when the answer is not known. For a discussion of the operation of 
response sets in personality assessmentj see Jackson and Messick (ref. 76). 

Since response sets make the interpretation of test scores anibiguous 
iDecause they measure things the test constructor is not primarily inter- 
ested inj their influence should te minimized* Response sets tend to 
occur in situations which are somewhat unstructured and/or too difficult 
for the examinee. ^Bieir influence may therefore te minimized ty a re* 
structuring of the test. 

let us look at some response sets which might occur in prograuiming 
and see what might iDe done alDout them. 

The learning of paired-associates is a fairly coiEonon task with which 
a program might deal.* In such a task the student must learn to associate 
particular responses with particular stimulij e.g.j state capitals with 
naines of states^ telephone niunbers with people^ names of syuibols with 
symbolsj etc.** If a program always presents the stimuliis terms of paired- 
associate items in the same orderj the student may learn a chain of 
response terms without paying attention to the stimfitlus terms. This would 
permit a response set to operate which might lead to the premature termi- 
nation of trainingj since the student would appear to learning. the 
paired associates as paired-associates. The programmer could prevent the 
formation of this response set ty scrambling the order in which the stimulus 
terms are presented on successive occasions. 

In this example^ the tendency to learn the response terms as a chain 
without regard to the stimulus terms may or may not ultimately make it 
easier to learn the response terms as responses to their respective stimuli. 
This is a q^uestion which might te ansvered ty research on the learning 
process. The point made here is that the response set in question may 
interfere with the measurement of the student's proficiencyj that iSj how 
well the student does when the stimulus terms remain in a constant order 
may not te a good predictor of how well he would do if the order were 
scrambled. 



In the learning of "continuous discourse" materialSj the programmer may 
make considerable use of "prompts." In a prompted frame the student is 
enalDled to respond correctly on the basis of knowledge of syntactical 
restraintSj pat verbal associations j etc.j for exaanplej "Just as smoke risesj 

warm air will also " (ref. p. 535)- Such prompting techniq^ues 

are assimied to facilitate learning. It is importantj howeverj to distinguish 
between frames intended to promote learning and frames intended to see if 
learning has taken place. The former might be called instructional frames 
and the latter j criterion frames. The same prompting techniques which may 
enhance learning on instructional frames should not allow response sets to 



. In factj some devices are designed for pal red -associate learning 
material exclusively. 

The associations need not be one-to-one (see ref. 127)* 
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operate on criterion frames • The programmer vill detect the more oljvious 
opportunities for response sets to operate inspecting the criterion 
frames* 

For those frames in which numerical responses are to be given in a 
multiple-choice formatj a special response set may operate. Ebel (ref» 33) 
notes the "***strong tendency for the examinees to confuse the aljsolute 
value of the ansver with the response position lioed to indicate it»" This 
tendency might, Ije reduced (ijut proljaljly no% eliminated) Ijy using leT^ters 
rather than numljers to indicate the response positions. Shay (ref. 120) 
reports that some learners showed this tendency in going through his program 
on Roman numeralSj and xmfortunately this was not corrected when the program 
was pretested. Perhaps a color coding of the response alternatives would 
have eliminated this confusion, 

Estaljlishing Time Limits and Length of Tgst (frogram) 

The question of how much time to allow for a test iSj of coursej insepa- 
reHHa from the question of how long the test should be, Aiministrative con- 
siderations usually serve to limit the amount of time availaljle for testingj 
and in this way indirectly limit the length of the test. For ^ fixed amount 
of testing timej the test constructor tries to provide a sufficient number 
of items to adequately sample the behavior In which he is interested. Ifj 
howeverj he includes too many itemSj the test may overen^ihasize speed of 
responding when the test is intended to measure something else (ref* 135)* 
When time permits and enou^ items are availablej the test constructor may 
add item^ to his test to increase the precision of his measurement. Under 
the proper conditionSj the amount of increase in the test's reliability may 
then be predicted by means of the Spearman-Brown formula (ref » 65)» 

In prograramingj we cannot state any precise relationship between length 
of program and time taken by learners* In the earlier section on limitations 
during operational usej we saw that adding frames to a program may make each 
frame easier to respond tOj and in this way decrease the amount of time 
needed to go throu^ the programj while increasing the amount learned* It 
does not seem reasonablej of coursej that adding frames to a program will 
always decrease the amount of time needed by the learners and increase the 
amount learned • It may bej howeverj that whenever frames are added to a 
program so that the time needed to go through the program decreasesj this 
is always accompanied by an increase in learning* If research supported 
this hypothesized relationship^ it would suggest that during tryout the 
programmer should pay attention not only to whether frames are responded to 
correctly but also to how much time is required to respond to each frame. 
When a frame requires an unusually long tiinej this might indicate that 
£idditional frames are needed prior to it. 

An additional timing consideration C£in be resolved during tryout of a 
program* There appear to be large individual differences in the amount of 
time learners take to go through the same program, Rothkopf (ref • 115) 
reports that a range of times needed on one program is from 23 to 60 hours; 
Shay (ref» 120)j from 31 to 176 minutes; and Gagne'^and Dick (ref» 52)j from 
190 to 380 minutes* Since the fastest learners may take only about one- 
fifth to one-half as much time as the slowest learnerSj the programmer must 
make some provision for occupying the time of those who finish first* Try- 
out'data can provide some idea of the range of times to be e}cpected with 
a particular population of learner? using a particular program, 
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Sfelectlng Items (Frameg) 



Flna3J.yj a major purpose of pretesting a test is to littprove and select 
Items. The basic data the test constructor may get from pretesting Items 
are their difficulties and their correlations with other items and with an 
outside criterion. 

Item difficulty is used for two basic purposes. One is to select itemsj 
and the other is to order the items selected in the final form of the test. 
If the test constructor wants to loake the maximum number of discriminations 
among the people in the group testedj and the available items are completely 
uncorrelatedj then he selects items which have p values of .50.* If the 
items were perfectly correlatedj he would choose* items spread over the 
entire range of difficulty, if he wanted merely to divide the group into 
two subgroupSj he would choose items with £ values which corresponded to the 
proportion of people he wanted in each group. If j for examplej he wanted the 
hi^er scoring subgroup to constitute 20 percent of the population testedj he 
would choose items with p values of .20. When he uses the item difficulty 
information for ordering"the items in the final form of the testj he generally 
arranges them from easiest to most difficult. 

In progr:*Tmingj frame difficulty information obtained during a tryout 
may also be very useful in selecting^ rewritingj andj perhaps^ reordering the 
frames. Carr (ref. 15) has listed what he considers to be five possible 
sources of error in program writing. ISiey may be paraphrased as follows: 
(1) Incorrectly specifying criterion behavior; (2) Incorrectly specifying 
initially available behavior; (3) Providing an inadequate amount of training 
material; {k) Improperly sequencing the material; and (5) Moving too 
quickly, (it is not quite clear how this differs from the third source of 
error.) 

It would seem that each of these sources of error except for the first 
could be revealed by pretest data on frame difficulty. Iftifortunatelyj it 
may be hard for the programmer to determine just which sourcej or sourceSj 
of error is operating in a given situation. 

A general procedure might be to periodically place in the program what 
we have called criterion frames — frames which allow the students to demon- 
strate mastery of some particular aspect of the subject matter, ^e pro- 
grammer could obtaJa data on the difficulty of these frames outside of the 
context of the programj and conrpare them with the difficulty of these frames 
within the context of the program. If a cr;iterion frame is easier within 
the context of the program than outside itj then the prograjraner may assume 
that the frames previous to it in the program contribute toward the learning 
of what the criterion frame tests for. If the criterion frame is as e<iually 
difficult within the context of the program as outsidejthe frames previous 
to it in the program may not be adequate . • ^ 

In addition to item difficultyj the correlations of an item with other 
items and with an outside criterion are other data which are obtained during 
tryout and which can be usef*il in selecting items. We have already seen how 



The £ value of an item is the proportion of examinees attempting it 
that answer it correctly. 
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Information about the intercorrelatlon of the Items Is used In conjunction 
with Information about Item difficulty to maximize the numbers of discrimi- 
nations made by the test. Mow we will see how Item-test and Item-outslde 
criterion correlations can be used by themselves . 

When the test constructor vants his test to measure an ability or traltj 
e.g. J anxietyj then the correlations of each Item with total test score 
become Important. Since each Item Is Intended to measure the same charac- 
teristic as every other Item^ the homogeneity of the ItemSj as may be measured 
by the correlation between each Item and total test scorej* becomes a basis 
for Item selection. On the other handj the test constructor might be Inter- 
ested In developing an achievement test to define' some criterion behavior 
In which he is Interested^ and the correlation of each Item with total test . 
score may not be lit^EWDrtant to him. Finally^ he mlght-be ^Interested In pre- 
dicting an outside criterlonj without regard to the "purity** of the test that 
will be^ enable him to do so. In this case, he may see how well each Indi- 
vidual Item discriminates between two groups of people who are high and low 
on his criterion measure^ and then select those Items which best make this 
discrimination. Thomdlke (ref- 131j p. 232) points out that the test homo- 
geneity and Individual Item validity viewpoints are two extremes; In actual 
practice both considerations may be of some Importance to the test constructor. 

Can some concepts analogous to those of test homogeneity and Individual 
Item Validity be useful to the programmer In revising his program? 

When one sets out to construct a homogeneous testj then Item-test cor- 
relations are logically relevant as a basis for Item selection^ In program- 
ming^ If we consider each frame to be a test Item^ there Is no logi cal reason 
for using Item- test correlatl is as a basis for selecting Items (frames). As 
an emplrl^ial matter j howeverj this may turn out to be a useful procedure. 
Hosmt^r (ref. 72) used test Items with high Item-test correlations as frames 
for a Crowder-type program. Jones* data (ref- 80)j on the other hand^ showed 
a tendency for those Items which correlated lower with total test score to 
have higher instructional value* Jacobs (ref- 77) has discussed some of the 
difficulties In Interpreting Jones* results. We obviously need some research 
on the usefulness to the programmer of a concept analogous to that of test 
homogeneity; we will see now one direction such research might take.** 

While an Individual test Item might be Judged by the discriminations It 
makes^ an Individual p^Togram frame may be Judged by Its Instructional effec- 
tiveness. Bie basic paradigm for evaluating an Individual frame might be to 
construct two versions of a program — one Including the frame to be evaluated 
and one omitting It — and to compare the criterion performance of otherwise 
comparable groups of learners given the two versions. As an operational 
procedurej howeverj the application of this paradigm would be extremely Im- 
practical with a program of even moderate length* The paradlgi» might be used 



Other 'measures of test homogeneltyj those of Loevlnger and Guttmanj are 
discussed by Guilford (ref. 6Uj p. 363-36^1). 

In a discussion In Section k of the application of Guttman scaling to 
programmingj we will see another direction such research might take. 
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In a research study to obtain criterion n^aaures of the Instructional 
effectiveness of a set of fraroes. We could then determine how well Item' 
test correlations could predict these measures of instructional effectiveness. 

As an alternative, one might construct a criterion test whose items can 
be identified in one-to-cne fashion with instructional frames or sets of 
instructional franjes in the program, - The test would then be adjnln;lstered 
both as a pre- and posttest. An instructional fraine or set of fraiiies which 
failed to incx'ease the proportion of learners getting the corresponding test 
item right from pre- to posttest would be revised. Here the problem is 
whether instructional and test items can actually be matched in a one-to- 
one fashion without cross-contaraination. For a continuous discourse or 
structured subject matter, that is, one in which certain topics must neces- 
sarily precede others, this matching may be impossible. 

Be vising Items (Frames) 

Both the test constructor and the programmer may wish to use the infor- 
mation obtained in a tryout for revising items (frames), as well as for 
selecting them. In both test construction and program development the suc- 
cessful revision of items or frames may be pretty much of an art, vhich means 
that a complete set of jnjLLes cannot be explicitly stated for this activity. 
In test construction only two rules have been found, both of which deal with 
the revision of multiple-choice items. 

(1) Eliminate or revise Itematives which attract very few exaJtdnees. 

(2) Eliminate or revise alternatives which fail to make the proper 
discriminations. If the examinees who are highest on the criterion measure 
choose a particular incorrect alternative more frequently than the examinees 
who are lowest on the criterion measure, or if they choose the correct alter- 
native less frequently, then the alternative involved is not making the prop- 
er discriminations (ref, 131, p. 256). 

Oie might apply both these rules to programming. Rule (l) might not 
be valid in a programming context, since an alternative which few people 
choose could still conceivably serve some instructional function by its mere 
presence among the alternatives. Rule (2) might be rephrased: If high 
aptitude learners choose a particular incorrect alternative more frequently 
than low aptitude learners, or if they choose the correct alternative less 
fx'equently, then the frame is poor. In such instances there may be a subtle 
ambiguity in the frame of which only the high aptitude learners are aware. 

Further Pretesting and Revision 

The amount and kind of pretesting for* a test will vary with the available 
resources. Conrad (ref. 17) recojiimends a three-stage tryout procedure. The 
first stage would be intended to reveal gross defects in the test and, as was 
mentioned earlier, might use the test constructor's colleagues as examinees. 
The second stage wotiljd be for the purpose of item selection and revision and 
would utilize examinees from the population for which the test is intended. 
The third stage would provide information on time limits and serve as a 
"dress rehearsal." 

In programming, the amount and kind of pretesting or tryout would also 
be determined to some extent by the available resources- B'Jt in programming 
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a multiple-stage tiyout and revision may have much greater importance than 
in testing. If the test constructor starts with a sufficiently large number 
of itemsj he can discard those which fail to make the desired discriminatiors 
and retain those which make the desired discrindnations for operational xiae. 
In prograimoingj howeverj when a fraine fails to teach what it is intended to 
teachj it J or perhaps earlier framesj must be revised, replaced^ or reordered . 
The veruion of the program which emerges from this revision must then be 
tried outj and this process may have to be repeated ^nany times. 

Evalmtion 

. When the final form of a test becomes available after pretesting and 
revision are completedj the next step is to evaluate it. Since tests are 
used to classify peoplej when we evaluate a test we try to find out how well 
it classifies people. Thenj before putting the test into operational usej we 
compare how well the test classifies people with how well the best available 
alterncte procedure classifies people. Similarly^ in evaluating a programj 
we want to know how well it produces the desired criterion behavior and how 
well it compares with the best available alternate training method in pro- 
ducing the desired behavior. As we shall seej there are many considerations 
in evaluating a test which also apply to evaluating a program. 

The specific way in vrhich we determine how well a test classifies people 
depends on omc purpose in classifying them. In testing educational achieve- 
mentj we are inte^;i^ed in vhat the test constructor calls content validityj 
that isj "how weJi the content of the test samples the class of situations 
or subject matter about which conclusions are to be drawn" (ref . 3j P* l3). 
Content validity is determined by comparing the content of the test with the 
content of the instructional or training course and/or the statement of 
objectives for the course. The test items should not only be derived from 
the course objectives but also should adequately sample the range of tasks 
for which the training was intended. A common mistake in preparing a test 
of educational achievement is to include items which indeed present the 
examinee with tasks for which the training was intendedj but which 

"...limit the test series: to the elements of the criterion 
series that are most conveniently and most easily repro- 
ducedj or most easily and objectively observed and evalu- 
ated... (so that). . .many of the move unmanageable but more 
important and crucial elements tend to be neglected inj or 
omitted from^ the test" (ref. 92j p. l53). 

Since we use an achievement test to evaluate a programj the above con- 
siderations concerning the content validity of an achievement test are q.uite 
relevant. Many programs may be intended not just to provide the learner 
with certain '*termlnal" skills but rather to serve as a basis for learning 
more advanced subjects. In these cases many problems arise in the proper 
measuring of "achievement." It may be relatively easy to test what behaviors 
the learner who has gone through the program can now performj but this may 
not be related to how he will learn new material. Kendler (ref. 83) and 
Gagn/ (ref. 51) have discxiased the problem of measuring how well the learner 
who has gone through a piogram can deal with the range of sittiations in which 
the programmer is interested. In the terms which we will discuss nextj an 
achievement test which can provide this measiarement is said to have predic- 
tive validity. 
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In contrast with content validityj in which performance on the test inay 
))e of interest in itself j there are three other types of test validity which 
are estalilished T}y relating test scores to criterion scoresj nantelyj pre-* 
dictivej concurrentj and construct validity. Predictive validity refers to 
how well a test can predict future perf orijiance . Concurrent validity refers 
to how well a test can discriminate among presently identif ialile groups. 
The test constructor who attempts to establish either predictive or concurrent 
validity faces these problems of collecting criterion data that were inentioned 
in the earlier section on specifying objectives. Construct validity refers 
to how well the test jneasures some trait or quality (construct) which is pre-* 
sumed to be reflected in test performance. The test constructor attempts to 
establish construct validity by hypothesising and verifying certain relations 
between the test and other variables.* 

In trying to establish predictivej concurrentj or construct validityj 
the sample of exEuninees cannot be as haphazardly assembled as in certain 
stages of pretesting and tryout. The sainple of examinees should be.repre^ 
sentative in abilities of the population for which the test is intendedj andj 
as far as is possiblej representative in motivation as well. This point 
would seem to apply directly to the evaluation of programSj also. 

53ie test constructor inay also be interested in " face validity" ; whether 
a test looks like it will do the job for which it is intended.^ This con- 
cept may also 'be important to the programmer: if the program does not look 
like it will do the job for which it is intendedj the learners may simply 
refuse to go through it. We do not yet know what characteristics a program 
must have in order to possess face validity; for a sampling of some students' 
reactions J see Roe (ref . 113)* 

Earlierj in our discussion of insuring the adequacy of a criterioUj we 
mentioned the consideration of criterion reliabilityj or consistency of 
criterion measurement. Consistency of measurement iSj of coursej also desir- 
able in tests which are used for predictlonj as well as In tests which are 
used as criteriaj and so test reliability may be looked for in evaluating a 
test. Thorndikej howeverj suggests that 

"If anythingj the significance of reliability has been 
overestimated. It must be remembered that precision in 
a measurement procedure is a necessary condition only 
for obtaining significant relations between different 
measures and is not an end in itself" (ref. 131j 
pp. lOlt-105). 

The programmer may generally want the changes in behavior his program 
brings about to be lasting rather than teinporaryj and he might speak of this 
characteristic as in some way analogous to "test-retest" reliability. Uta- 
fortunatelyj the available knowledge of test reliability does not suggest 



For more extensive discussion of construct validityj see Cronbach and 
Meehl (ref. 2k)j APA Technical Recommendations (ref. 37). 

For further discussion of face validityj see Mosler (ref. lOl). 
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any techniques for the programmer to use In order to promote the retention 
of what his program teaches * 



Relative Costs and Benefits 

We have discussed some considerations In evaluating how well a test 
classifies people and the Implications of these considerations for evalu- 
ating how well a program produces the behavior desired* Now we will look 
at the question of costs of tests and programs and at comparisons with the 
best available alternatives to tests and programs* 

When the test constructor has gathered validity data on his testj he 
must then reach a decision as to whether It Is profltstble to put the test 
Into operational use* For a selection testj this decision can be made by 
considering the validity coefficient of the test (the correlation coef- 
ficient between test scores and criterion scores )j the relative nuniber of 
people to be tested and positions to be filled (the selection ratio )j and 
the cost of .administering and scoring the test* ■ The higher the validity 
coefflclentj and the lower the cost of the testj the greater the benefit 
to the test constructor using the test* The selection ratio has a m6re 
complicated relation to the benefit obtained through use of the test (see 
ref* 23j pp» 36-37)' When Information on' these variables becomes avallablej 
It can be combined according to formula^ given by Cronbach and Gleser* The 
basis of combining what may be rather diverse ntsasures is a cost anai Is* 
Cost analyses were previously discussed In the section on selecting c: _:.erla* 

When a measure of the benefits due to the use of the test Is computedj 
It should then be compared with a measure of the benefits due to using the 
best available alternate selection procedure* . The best available alternative 
my be to use some already available piece of Information (e*g*j highest 
grade of school conrpletedj Interviewer's impression of applicant j etc*)j 
or it may be merely to randomly select applicants* In any event j the test 
should be put into operational use to rejxLace or to be used in conjunction 
with the best available alternative only to the extent that doing so makes 
a distinct contribution to the test constructor's goals » 

Ih programmingj there are also a variety of diverse elements which must 
be -lombined to get a measure of the benefits of using a program* We have 
already seen that there may be several different criterion measures of post- 
test performance (e*g*j rate of performancej quality of perfonaance) which 
must be combined to yield a single criterion measure for each individual* 
Criterion measures must further be combined with certain items of cost to 
etrrive at a measure of gain to be expected through the use of the program* 
Two major items of cost may be learning time and the expense of preparing a 
program* Perster and Sapon have stated: 

"***a series of materials could probably be constructed 
in which each item is scientifically designed so that 
the students will progress from a zero knowledge of 
German to a complicated repertory of the level of a year 
of college German without ever having made an error" 
(ref* h2, p. 185)* 

While this may be soj the programmer will want to know guch things as how 
long it would take to go through such a program and how e5<5>ensive such ^ 
program would be to develop* 
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Rigney and Fry (ref. Ill) have outlined the follov/lng Items which they 
feel shoiOjd enter Into a cost evaluation: 



1. Cost per unit 

a. Tor program 

b. Per student 

c. Per machine 

2. Investinent 

a. Initial 

b. Long term 

3. Training time per student 

k. Quality of students required (aptltudej experlencej etc.) 

5- Quality of Instructors required (credentlalSj experlencej etc.) 

6. Logistics Involved 

a. Spacej powerj maintenance req^ulrements 

b. Program reusabllltyj useful life 

7. Practical effectiveness of method 

a. In relation to training objectives 

b. In relation to competing methods 

8t Acceptance of method 

a. By students 

b. By Instructors 

c. By administrators 

An additional cost consideration In many educatlonalj Industrial; and 
military settings Is how quickly the subject matter may be expected to 
become obsolete and how e5q)enslve It would be to make changes In the pro- 
gram to cope with this obsolescence. 

hAs In evaluating a testj the basis of combining rather diverse measures 
in or^er to evaluate a program Is a cost analysis. Kershaw and bicKean 
(ref ^ Qk), although they do not deal e5<5)llcltly with programmed Instructlonj 
present a detailed discussion (with hypothetical examples) of the application 
of cost accounting procedures to an educational system.* The decision as to 



Such cost analysesj in addition to providing a basis for evaluating 
programs, may also suggest research designed to reduce costs. Rothkopf 
(ref . Il4)j for examplej compared two methods of dropping items in the 
learning of paired-associates^ and found that they did not differ in trials 
to learn or in amount retained per trial to learn^ although one method was 
presumed to involve inore expensive equipment* 
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whether a program should put into operational use my te made com- 
paring the net C^n to te expected from using the program with the net 
gain to te expected when the test available alternate training method is 
used* The test available alternative may te the use of another programj 
ungulded self -study from bookSj orj perhapSj no training at all* The test 
alternative most commonly available at present is protatly "conventional" 
or "traditional" instruction (e»g»j lectureSj recitation classes)* 

Two major methodological problems arise in comparing a program to te 
evaluated \jith the test available alternate training method* One tasic 
paradigm for comparing any two instructional methods is to use one of the 
methods with one group of students and the other method with a "comparable" 
group of students and to compare the achievem:ent of the two groups* Con-- 
ceptuallyj "comparability" of groups means that the conclusions reached 
would be the same no matter which group was assigned to which instructional 
method* Operationally j one tries to obtain comparability by either a random 
assignment of individuals to methods or by a random assignment of intact 
groups (classes) to methods* This type of procedxirej howeverj may often not 
be administratively feasiblej and has not always been used in studies of 
programmed instruction* In some of the pioneering work of Pressey and his 
students (e*g*j refs* 11$^ 126)j as well as more recent work Ce*g*j ref* 73)j 
the experimenters have resorted to nonrandom assignments* Although they may 
demonstrate that the groups used did not differ initially on mean aptitude 
or pretest scoreSj this may o^ may not indict^te comparability* The evalu- 
ation of a program must be based upon experimental comparisons of comparable 
groupSj so that any differences or lack of differences in achievement may 
be ascribed to differences in prograimned and conventional instruction rather 
than to pre-existing differences in groups* 

Another methodological difficulty which comes uj in comfparing programmed 
and conventional instruction is this: while programmed instruction may be 
"standardized" (that iSj the material presented t5 the student does not 
depend on what classj schoolj or city he Is in)j conventional instruction is 
not standardized* We know that teachers differ markedly in what they do in 
the classroomj although we know little about teacher differences in the 
effectiveness of what they do (ref* lOO)* For this reason^ the programmer 
who wants to compare programmed and conventional instruction mwst in some 
way sample the variety of conventional instruction available so that he may 
have greater confidence that the results he finds will apply to his partic- 
ular situation* Some programmers and research workers have raised the 
question as to whether the same program which is useful with "dull" students 
is also useful with '^bright" students* The programmer should also be inter- 
ested in knowing whether the same program which is useful when "poor" con- 
ventional instruction is available is also useful when "good" conventional 
instruction is available* 

Providing Information to Test (Program) Users 

In order to use test scoresj one needs to knowj of coursej how the 
test scores are distributed aniong the members of a relevant population of 
examinees* The test user who wants to compare an individual's or group's 
test scores with those of other individuals or groups can often make a more 
useful comparison by referring to score distributions from rather specific 
reference groups* A high school principalj for examplej may find it more 
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useftxl to know how the achievement of his ninth-grade students on a "standard- 
ized'* mathematics test conq)ares with the achievement of other ninth-grade 
students in cities with populations of less than 25jO0O in the Mid-westj 
rather than how it coinpares with a reference group randomly drawn from around 
the country. For this reasonj the test constructor may make available score 
distributions for various sub-populations. Similarly^ the potential user of 
a program may find it more useful if the learning times and posttest scores 
are broken down for various sub-populations. 

A second way in which the test constructor might present data^ so as 
to make his test more useful to othersj is to establish the equivalence 
of scores from his new test with scores from other tests of similar use. 
For examplej a potential test user may know that 6o is a xisefxil cutting 
score for his purposes when the test given is Mathematical Aptitude Test A* 
If J howeverj Mathematical Aptitxide Test B is the test from which he is given 
applicants" scoresj how can he tell what will be a useful cutting score? 

3Jiis type of problem arises in institutional testing programSj such as 
are carried out by the College Boardj in which the test items used on suc- 
cessive forms of the tests must be continiially changed. The problem differs 
from the one of providing detailed data on the performance of sub -populations 
on a particular test in that here one has to deal with a new test given to a 
new (and potentially different) population. 

A basic mechanism in determining or producing "equivalent" scores is 
to have some overlapping items common to both of two forms to be equatedj 
so that both examinee populations have a common "core" of items. Jiyer and 
King (ref . 32j pp. 101-lCA) give more details on this procedure as it is 
carried out by the College Boardj and Flanagan (ref. h^) also discusses a 
number of ways of obtaining comparable or equivaLmt scores. 

Ttie potential user of a program will often find that his intended 
population is not the same as the population used in evalxiating the program. 
He may specifically want to know up to what level of proficiency the program 
will bring his learners and how long it will take them to complete the 
program. The techniq^ue of equating test forms discussed above may suggest 
a procedure for estimating the values of these two variables. While a basic 
mechanism in test equating is to provide some overlap in the test items 
given to the two examinee populationsj a useful analog of this technique 
in programming might be to obtain scores for the potential learner popu- 
lation and for the population used in evaluating the program on the same 
testsj namelyj those tests which predict time to learn and posttest scores 
in the latter popixlatlon. 

Can such tests be found? Carr has stated; 

"One might hypothesize that effective instructional 
devices might wipe out differences in achievement 
measures associated with intelligence or aptitude 
test performance. The findings of a number of ejcperi* 
ments seem to support this hypothesis" (ref. 15j P# 56l). 

He goes on to cite the studies of Porter (ref. 105)j Irion and Briggs 
(ref. 75)j and Fereter and Sapon (ref. i 
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^ The evidence from the studies Carr cites Is not too clear-cut* In 
Ferster and Sapon's studyj for exanrplej only six out of the 28 students 
who started the coursej finished It* It may be that the aptitude test 
which could not predict the ordering of the achleveioent test scores of 
the slxj could have predicted which students would drop out* Purthermorej 
the studies of numerous other Investigators Ce*g*j ref s* 20j 59j fi2j 99j 120) 
have since shown that aptitude and pretest measures could be used to predict 
time to go through a program and/or achievement on a posttest* ^Bie cor- 
relations reported have generally ranged, between *30 »50» Since the 
potential user of a program may be interested In predicting group means of 
time to 3.earn and pjstteat scoresj such correlations meiy indeed be adequate 
for his purpose* 

The success of tne proposed procedure will depend on the extent to 
which the basic assimafptlon of homogeneity of regression Is metj that Isj the 
extent to which an aptitude or pretest measure^ which correlates with 
achievement or time to go through the program In the evaluation group of 
learners J shows the same correlation in the new group of learners in which 
the potential program user is interested* 
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Section Some Selected Belationships Between 
Testing and Frpgrainmlng 

In this section we will deal vith two topics which do not xiBually get 
imch attention in testing but which are quite inrportant in programming; 
the ordering of frames within a program and the assigning of different 
learners to different sequences of material. We will explore the extent 
to which testing considerations may prove useful for each of these topics. 

Item (FranK) Ordering 

In testingj the ordering of the items is considered only for "moti- 
vational" purposes. In generalj the test constructor tries to arrange the 
items in ascending order of difficultyj that isj from easiest to hardest.* 
If the reverse order is followedj a lower test score may result (ref . 96). 
While the test constructor may be concerned with how the items are ordered 
and with the general difficulty level of the items -round a given item 
(ref. 67)j these do not appear to be "cognitive'* variablesj that iSj an 
ascending order of difficulty may facilitate getting the harder items right j 
but without giving any aids to answering specific items. This is because 
the test constructorj in choosing items for the testj has followed the rule 
that "if an item depends in any way upon the preceding onej neither must 
reveal the answer to the other" (ref. 7j p. 63). 

In programmlngj howeverj it is commonly believed that there should be 
a'hierarchical relationship between each frame and the next: "At each step 
the programmer must ask *what behavior must the student have before he can 
take this step?* A sequence of steps forms a progression from the in- 
itially assximed knowledge up to the specified final repertoire. No step' 
should be encountered before the student has mastered everything needed to 
take it" (ref. 125j p. iSk). 

There is little evidence available on this point. Gavurin and Donahue 
(ref. 5**) compared a "logical" with a random sequence of framesj but it is 
not clear in what way the "logical" sequence was "loglcalj** or how other 
programmers can provide "logical" sequences for their subject matters. Roe 
(ref. 113j p. 13) mentions the following anecdote concerning the effects of 
order of frames on learning: 

"One studentj who failed to read the instructions at the 
beginning of the programmed textbook^ read down the page 
instead of from page to page with the result that the 
sequence of items he saw were numbered: Ij kO^ 79j ll8j 
157; 2j Iflj 80j 119j 158; 3j hz, &1, laOj 159; and so on. 
I^is student s^tUl managed to get a high score on the 
criterion test*" 

While the anecdote is certainly amusingj one wonders whether the learner would 



The test constructor willj of coursej arrange together those items which 
depend on the same reading passage^ diagram^ etc.j and also group together 
those items which are in the same format. 
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have received an even hl^er score on the criterion test if be had read the 
instructions. We certainly cannot conclude that the ordering of frames is 
uniurportant . 

If we grant that the progranuaer should ask what tehavior a student must 
have before he can take each stepj hov can the prograimner answer this 
question? In many cases the programmer can resort to a detailed task- 
analysis. If the prograinmer id attempting to teach ^'dividing fractionsj" 
he must obviously be concerned with the subordinate goal of "multiplying 
fractions." If he is attempting to teach long divisioUj he must first be 
concerned with teaching additionj subtractionj and multiplication. In such 
cases the learning of some skills must precede the learning of others be- 
caxise the skills to be learned first are component parts of the skills to te 
learned later. 

In other casesj howeverj there may be no sxich part-whole relationships 
among the subskillsj orj if there are some they may not be Immediately 
apparent. Supposej for examplej one ie learning to drive a car. Consider 
the subskills of manewering in traffic and parking. Must one of these 
skills be taught before the otherj andj if soj which should be taught first? 
We will examine whether testing can contribute toward the answering of such 
questions. But before we can proceed to explore this possibilityj it is 
necessary to know something about a type of measuring instrument called a 
Guttman scale.* 



With many tests j when we are told only that a given person gets J of 
the 10 items correctj** we cannot say vhich of the J items they were. Otj 
if we are toM that each of two people both got 7 itemsj we cannot say 
whether they both got the same J items right . If j howeverj the items in the 
test form a perfect Guttman scalej then we couldj when told how mny items 
a person got right j say just which items they were. 

What would such a test look like? We can diagram a generalised scheme 
of the possible different ways in i^hich people could respond to the items in 
a Guttman scale. For conveniencej let us consider a Guttman dcale containing 
only h Items. In the diagram below we will let a "1" mean that the person 
gets the^items i^ight and a "O" mean that the person gets the item wrong. 



For more information on Guttman scalingj see Guttman (ref . 66)j 
Edwards (ref. 3^)j Torgerson (ref. 133)^ Riley^ Riley & Toby (ref. 112)j 
Green (ref. 6l). 

The Guttman scale was developed in the field of attitude measxirement 
in which the terms *'get an item right" and "g^t an item wrong" are replaced 
by "endorse a statement" and "fail to endorse a statement." 
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Items 



Response . Total Number 

Batterns (l) (2) (3) (k) Right 

All 1 1 if 

B 1 1 10 3 

C 1 1 0 0 2 

D 1 0 0 0 1 

E 0 0 0 0 0 

The items might concelvalDly be : 

(1) 3 + T = ? 

(2) 8x6=? 

(3) 130 f 5 = ? 
{k) iCkl J 26 = ? 

If items (l)^ (2)^ (3) and {k) form a perfect Guttman scale^ then a 
person answering these items must fall into one of the Response patterns 

or E» if he gets only one item right, it must he item (l)j 
if he gets two Items right^ they must he items (l) and (2); etc* In general 
in a perfect Guttman scale one can reconstruct perfectly from a person's 
total score exactly which items were gotten right* 

Now let us return to see how Guttman scaling may he related to a 
decision as to whether either maneuvering in traffic or parking must precede 
the other in a training sequence* In most, if not all^ training situations 
the trainees do not start off with absolutely no background^ with no partial 
knowledge of what is to be learned* We saw in the previous chapter that it 
is lAie programmer's job in assessing the available resources to find out 
just what relevant knowledge and abilities lAie trainees start with* Suppose, 
then, that a person in charge of training people to drive automobiles tests 
each of a large group of trainees on their initial ability to maneuver in 
traffic end on their ability to park* Suppose further that each trainee is 
scored pass or fail, 1 or 0, on each of these abilities, and that each of 
the trainees is found to fit into one of the Response Patterns A, B, or C 
shown below. 

Response Maneuver In 



Patterns 


Traffic 


Park 


A 


1 


1 


B 


1 


0 


C 


0 


0 
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We notice that some people can both maneuver In traffic and park 
(Response Pattern A)j some people can neither maneuver In traffice nor 
park (Bespoiise Pattern c)j aM some people can maneuver In traffic l>ut 
cannot park (Response Pattern B)j but that no one can park without 
being able to maneuver In traffic- Ability to maneuver In traffic and 
ability to parkj thereforej form a two-Item Guttman scale* What can the 
person In charge of training validly conclude from this? 

It may be quite tempting to conclude that In learning how to drive a 
ceLTj one must learn how to maneuver In traffic before one learns how to 
parkj but It Is not legitimate to conclude this. The fact that "maneuvering 
in traffic" and "parking" may form a Guttman scale might Indeed reflect 
something about the "Inherent structure" of the learnlng-to-drlve subject 
matter. On the other hetndj It might merely reflect the prior learning 
history of the tralneeSj that Isj It might reflect what has preceded what;, 
not what must precede what* It might be that driving instructors always 
teach how to maneuver In traffic before they teach how to parkj and that 
people of type B are people who discontinued some prior training after 
learning how to maneuver In traffic but before learning how to park* OTj 
people of type B might be e^qperlenced drivers who are used to diagonal 
parklngj and the test may have called for parallel parking* If either or 
both of these e5<5lanatlons were ^correct j It would not necessarily follow 
that people of type Cj who can nelther maneuver In traffic nor parkj most 
be taught how to maneuver in traffic before they are taught how to parK;"" 

Of what valuej theUj Is Information on whether certain subskllls form 
a Guttman scale to the prograiffiner? We have seen that he cannot use such 
Information to prescribe a necessary ordering of a set of tasks which form 
a Guttman scale for trainees who Initially possess none of these skills* 
Such Information can be useful If thft programmer wetnts to arrange training 
on various subskllls Into a given sequence and then allow different trainees 
to enter this sequence at different points* Ifj for examplej a programmer 
finds that a set of subskllls form a Guttman scalej he m^ sequence training 
on these subskllls according to how they are ordered on the scale* What a 
trainee can Initially do would be,re]5resented by a string of I's followed by 
a string of O'Sj and he would begin training on the subsklll represented by 
the first 0* The potential economic advantage of this procedure would be 
that each trainee would not waste time in being taught to do what he can 
already doj while the ordering of training tasks Into a single sequence would 
greatly simplify administrative matters** 

Whether such a procedure would actually pay off would depend on the 
relative costs of (l) determining the scalability of subsklllsj (2) determi- 
ning each trainees' place on the scalej etnd (3) training time* It would 
also dependj of coursej on whether a set of subskllls which scale were found* 
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If J for exanrplej six subskllls formed the scalej and a trainee needed 
Instruction on only three of themj they would be the last three In the 
training sequence* If the auibskllls did not form a Guttman scale^ they 
might be the first j fourth and sixth subskllls* If the programmer rear- 
ranged the training sequence so as to make these subskllls consecutive for 
this trainee J In doing so he may destroy, the consecutlveness for another 
trainee * _ 
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On this last pointj Schults and Siegal (ref. Il8j p. lk2) have recently 
found that "Check lists for use in evaluating task performance in several 
related naval joh specialties (ratings). . .meet the.#.GuttiBan scalahility 
requXrements." In their work they did not test each person for various 
suhskillSj hut rather asked his supervisor whether he was "checked out" 
on each skill. Research is needed to find out if actual performance testSj 
rainier than ratingSj would yield the saine result; if notj the scalahility 
of the subskills may only he in the perception of the siqpervlsors doing 
the rating. 

Looking ahead to still later developmentSj we my find some situations 
in which the suhskills do not at first appear to form a Guttman scalej hut 
when the population of trainees is subdivided into two or more suhpopu- 
lationsj the subskills form different Guttman scales for each subpopulation. 

Supposej for examplej in one school system the "topics" in French 
classes were taught in the order; listenings speaking^ reading^ gramiaarj 
writingj and French clvilisationj and in another school system they were 
taugjit in the order; French civilisationj grammar j readlngj writingj llsten- 
ingj and speaking* If the progranmier in charge of increasing the "knowledge 
of French" of students who come from these two school systems (and who may 
have ended formal instruction at different stages within each school) tests 
them on the various eUbskillSj the suhskills would not appear to form a 
Guttman scale. He may therefore feel that in order to avoid teaching students 
what they already knowj he would need to use many different seq^uences of 
material. Ifj howeverj he analyzed the test data from students coming from 
each school system separatelyj he would find for each school system that the 
suhskills did form a Guttman scale. He could then use this information hy 
providing two different sequences of instructional material and permitting 
students to enter the appropriate sequence at the appropriate point . 

In applying Guttman scaling to programmingj what units of ^alysis should ^ 
the programmer use? It prohahly would he more prof itahle to lump together 
a set of criterion frames which all deal with the same suhskillj score each 
trainee dichotomously "pass" or "fail" on the hasis of his responses to the 
set of frstmesj and see whether such suhskills form a Guttman scale than to 
attempt to scale individual criterion frames. Using the suhsklll as the 
unit of analysis will reduce the data to a more manageahle amount and in- 
crease the reliahility of the measures used. 

We started hy discussing how testing might help in discovering what 
training material mu st precede what other material^ in the sense of heing 
necessary for the learner to henefit from the later material. We have 
shifted our emphasis to the question of how testing might help in determining 
what training material might hest precede what other materialj in the sense 
of increasing the efficiency of learning. 

If there are a set of distinct subskills to he taughtj andj hy their 
naturej each of them could precede each of the ottierSj the direct approach 
to determining the hest way to sequence them:*^ would he to try out all 



One might ohject to the notion that there is a single hest ordering 
for all trainees. We will disregard this complication here. 
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possible orderlnga of the subskllls with conrparoble groups of learners. 
With as few as six subsfclllsj howeverj there ore 720 possible orderlngSj 
and Gagn^ and Dick (ref . ^2) have Isolated as many as 21 subskllls In the 
relatively limited mjor skill of solving slnrple linear equations. It 
appearsj thenj that the direct approach to the optimal ordering of sub- 
skills will not usually be feasible. 

Jones (ref. 79) has worked on the use of simplex theory as a more 
feasible approach to the problem. Basically he postulates that the pattern 
of intercorrelations among scores obtained on subskllls can suggest :iow 
training on these subskllls should be sequenced. Jones cites one instance 
(pp. 90-91) in which he feels this approach paid off. Rao (ref. 108j 
p. 252)j howeverj suggests that in such a procedure "...the particular 
conclusions reached for the best sequence of such a training program may 
well be drawn from actual experience with the problem at hand and not 
from the loose theory offered." There appears to be a need for an explicit 
statement of how the theory should be coordinated with observed eventSj as 
well as a demonstration of its alleged utility in determining the optimal 
sequencing of training on subskllls. 3pecifica3JLyj there is a need for: 

(1) Evidence in a wide variety of training situations that simplicial 
forms result from the particular abilities needed for various subskillsj 
rather than from the order in Which the subskllls are taught when data are 
collected . 

(2) Evidence in a wide variety of training situations that changes 
in the ordering of subskill training which are suggested by siz^plicial 
analysis result in more efficient learning. In discussing the instance in 
which a recommendation based upon simplex considerations was carried out and 
in which evaluation shoved "generally facultative" effectSj Jones (ref. 79j 
p. 91) states: 

"In this particular instancej thereforej simplicial analysis 
woidd have recomoiended the same course of action without 
either the expense or the delay of an experimental study. 
If this result can be generalizedj even if only a little 
bitj the uses of simplex theory in curricula development 
are very real indeed." 

By "generalizedj even if only a little bit" Jones seems to mean "generalizedj 
sometimes validly and sometimes not." Unfortunatelyj unless one can accu- 
rately predict when recommendations from a simplicial analysis will or will 
not be validj such recoBHnendations will remain of ujiknown usefulness. 

Maptive Programming 

In the previous section on item ordering we sfifr that under certain 
circumstances Guttman scaling could be quite useful to the programmer. 
These circumstances included the condition that different learners could 
enter the training sequence at different points. Hie purpose of this was 
to take advantage of the individual differences that initially exist among 
the learners; they should not be taught what they already know. In this 
section we will explore other ways in which the programmer mi^t take 
advantage of individual differences among the learners^ and see what testing 
considerations mi^t be relevant. 

52 



In general, the prograininer luay try to capitalize on Individual dif 
ferences ainong the learners not presenting all learners with the saii^ 
seq^uence of Instructional material, tut rather ty giving different learners 
different seq^uences of material which he feels are especially appropriate 
for them. In order to do this he inust somehow make distinctions among the 
learners J either tefore or during training, or toth. We will refer to the 
procedxire of differentiating among the learners In order to assign them to 
different sequences of material as "adaptive programming." As we shall see, 
there has teen some work done In the testlnfr area on "adaptive testing," 
that Is, testing In which the examinees are provided with different sequences 
of test itejins on the l^asis of their responses to prior test items. It is 
not, however, because of this work that testing considerations are relevant 
to adaptive prograjnmlng. Rather, it is because in adaptive programming the 
test programmer must make some measurement of the learners in order to 
assign them to different sequences, of instructional material. If measurement 
Is made, tten measurement (testing) considerations apply. 

Psrhaps the most important testing considerations that apply to adaptive 
programming relate to validity. As we examine various types of adaptive 
programming i/e will ask in each case how the costs and benefits of using 
adaptive programming coji^^are with the costs and benefits of not using adap*- 
tive programming. 

While the purpose of using adaptive programming rather than linear pro** 
gramming (in which every learner gets the same sequence of material) is to 
increase training efficiency, this purpose may not always be realized. We 
have already seen that whether Guttman scaling information is useful will 
depend on the relative costs of differentiating learners and of instructional 
time. We must also be concerned with the validity of the test used to dif*- 
ferentiate learners. When the test lacks sufficient validity, we cannot 
expect adaptive programming to pay off. Cronbach has said "The person who 
atterngpfts to differentiate individuals on inadequate data introduces error 
even when the inferences have validity greater than chance" (ref . 22, p. l8l). 

'*Kecognizing an optimum degree of differentiation mokes it necessary to 
re^-examlne and qualify statements commonly made in training teachers, to the 
effect that every pupil has his oyn pattern and the teacher must fit methods 
to that pattern, not treat the pupil in terms of the statistical average* 
...the teacher who is poorly informed regarding the unique patterns of his 
pupils should probably treat them by a standard pattern of instruction, care- 
fully fitted to the typical pupil. Modifying plans drastically on the basis 
of limited diagnostic information may do ham" (ref. 22, p. 1^). 

While adaptive programming may have potential benefitsj the programmer 
must realize that just as programmed instruction is not necessarily superior 
to "conventional" instruction, adaptive programming is not necessarily 
superior to linear programming* With this warning in mind, let us turn to 
examine some types of adaptive programing* 

Fi xed '-Treatment Placement 

The first major type we will consider is that in which on the basis of 
some pretest, learners are assigned to different but fixed se;,'Ances of 
material. Following Cronbach and Gleser (ref. 23), we wil^ refer to this 
type of adaptive programming as flxed-^treatment placement . In fixed-treat- 
ment placement the programraer prepares different fixed sequences of material, 
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and believes he can identify "types" of learners who will learn most ef- 
ficiently with each of these sequences. 

Under what circumstances would this type of adaptive prograiuming pay 
off? We can say that it will pay off when the assigning of different learners 
to different sequences of material results in more efficient learning than 
would be obtained by assigning all learners to the best single sequence of 
material. Supposej for examplej the programmer wants to classify all learners 
as either Type A or Type Bj and then give all Type A learners Fixed Sequence 
of Material I and all O^ype B learners Fixed Sequence of Material II. It may 
help the reader to think of Type A and Type B as hi^ and low aptitude peojiej 
and Sequences I and II as "large step" and "small step" programs respectively^ 
although the analysis given here will be more generally applicable. 

In order to assess whether this adaptive programming pays off^ the pro- 
grammer would first identify who the l^ype A and Type B i -le arej then 
expose randomly chosen subgroups of l^rpe A and Type P . )le to Sequences I 
and II. Suppose he found the type of interaction j is shown in Table 1^ 

that is^ an interaction in which Type A people let \.jre efficiently with 
Sequence I and l^j^pe B people learn more efficient! -^-^th Sequence II. Ihen^ 
if the testing and administrative costs were smaller than savings he would 
achievej he would give Sequence I to l^rpe A people and Seqiaence n to ^^pe B 
people when he put the program into operational use. Oa the other hand^ he 
ml^t find no interactionj or an interaction such as is shown in Table 2j in 
which both Type A and Type B people learn more efficiently with Sequence I. 
In such a casej he woiOd give Sequence I to all people when he put the pro- 
gram into operational use. 

In this procedure for determining the payoff from fixed-treatment place- 
mentj the prog *ammer must assign half of the learners to a treatment which 
he thinks is less than optimal. He must keep in mind that the primary purpose 
of this apparently inefficient procedure is to find out how valid his test is 
for fixed-treatment placement; it is not to train learners. The test con- 
structor must elso use an apparen-^ly inefficient procedure in validating a 
new selection test when he accepts all the applicants: his primary pur];>ose 
is to validate the test; it is not to discriminate among the applicants. 

Can the programmer hope to find interactions cf the type shown in Table 1? 
Stolurow (ref. 127) has summarized much of the experimental literature on 
human learning and concludes "The studies have provided few specific inter- 
action effects between learner variables and methods variables..." 

In a recent study of "adaptive" training procedures Clinej Beals and 
Sejdman (ref. l6) showed that on the basis of aptitudia test ccoresj trainees 
in a military setting could be assigned to different training sequences aimed 
at the same goalj with the result being a more efficient operation than if 
all trainees had been put through the "standard" sequence. 

In r.n auto-instructional settingj howeverj Shay (ref. 120) found no inter- 
action between three levels of sttbdent aptitude and three programs differing 
in number of frames and "difficulty level." Sliay himself points to a number 
of aspects of his procedure which ms^ have reduced his chances for finding 
such an interaction. His measures of student aptitude were taken from al- 
ready available scores on the ". . .Kuhlmann-Andersonj Detroit Primary^ and 
Public School Primary tests in all but the few cases where recent Binet Ift's 
were available "(pp. 37-38). He admits "...the possibility that the IQ*s for 
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An Interaction Which Is Usefxil 
for Fixed-Treatment Placenjent^ 



TYPE OF 
PERSON 



SEQUENCE 



II 



100 


80 


70 


90 



Table 2 

An Interaction Which Ifot Useful 
for Fixed-Treatment ELacement^ 





TYPE OF 
PERSCN 






A 


B 


I 


100 


90 


SEQUENCE 


70 


80 



higher numbers denote greater learning efficiency* 
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the several tests used were not coaanensurate and may have obscured any 
real relationship that existed" (p. 59)* Another consideration is that 
because of machine defectsj many students were given ambiguous knoulcdge 
of results on at least one frame. Finally^ as we saw in the last chapter 
under the topic of pretesting and revision^ some students confused the 
numerical labels* Por these reasons*j we should be reluctant to conclude 
that one could not find the desired interaction using Shay's programs and 
student populations » 

Regardless of what Siay found and how one might choose to interpret 
his findingj we mustj of coursej decide in each training situation separately 
whether it is worthwhile to use a fixed-treatment placement procedure. We 
may e5q)ect this type of programming to pay off when we have some insight 
into a good choice of tests for differentiating among types of people and 
among programs to use . 

At present J programmers have chief lyj if not exclusivelyj concerned 
themselves with measures of general aptitude as a basis for differentiating 
learners for this type of adaptive programming. Stolurow (ref. 127j p. 59) 
has pointed outj however j that the "...available research on the relationships 
between the learner's ability and his gains in learning do not justify the 
assumption that different programs have to be written for high and low 
ability groups." 

Three general conclusions seem to emerge from the research relating 
aptitude to learning: (l) Aptitude is positively related to learning; 
(2) Aptitude is not related to learning; (3) Aptitude is negatively related 
to learning. Among the possible sources of contradiction in this research 
are the use of different intelligence measures j the use of different types of 
learning scores (gain scoresj final achievement scores; time per unit scoresj 
units per time scoreSj nximber correct scoreSj etc.)j different degrees of 
experimental control over data collection (£ paced; S paced; laboratory settingj 
school setting)j different aptitude measures7 and different types of learning 
tasks (verbalj psychomotorj etc.)* Even if the available evidence consistently 
^howed high positive correlations between aptitude and learning measureSj this 
should not lead the programmer to use measures of general aptitude for fix?d- 
treatment placementj since he is primarily interested in the differential 
payoff froan various treatmentSj rather than in predicting achievement within 
one treatment. "General mental ability*.. is likely to be correlated with 
success in mathematics no matter how the subject is taught. If the alter- 
native teaching procedures are an abstract deductive iiiethod axii an applied 
inductive methodj the bright students should do better with either approach. 
...On the other handj there may be other qualities of the individual (sayj 
interest in abstract problemsj or liking for rigorous reasoning) which would 
have quite different relations to the two treatments. A measure which pre- 
dicted success \mder one treatment and not the other would be a much better 
aid to placement than a measure which predicts both" (ref. 23j p. 68). 

Stolurow (ref. 127j p. 51ff ) provides a good discussion of the "qualities 
of the individual" suggested by the literature on htiman learning. A number 
of rather recent studies suggests some additional qualities of this kind. 
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Allison (ref . 2) reports that "Measures of learning and jiieasxires of 
aptitude and achievement, which have generally "been treated experimentally 
as separate entitles, have factors In coaoonion with each other." The seven 
interpretable learning factors he found were Verbal Conceptiial Learning, 
Spatial Conceptual Learning, Mechanical -Motor Learning, three Rote Learning 
factors, and an **£arly vs. Late" learning factor. In some cases It my "be 
possible to alter a training situation which Involves primarily mechanlcal- 
JBOtor learning so as to Involve spatial -conceptiial learning. Then trainees 
with high mechanical -motor or spatial -conceptual factor scores could "be 
assigned to the more appropriate version of the training sltmtion. 

Bruner (ref. 13) reports that su"bjects who are provided with material 
to learn which contains their own preferred type of mediator (thematic, 
generic, or part-whole) remember better than subjects who are not given 
their preferred type of mediator. 

Msssick and Hills (ref. 98) li&ve shown that there are characteristic 
individual differences in the amount of information needed to malce "inductive 
leaps*" This may be Important for programs which require the student to 
Induce 3rules. 

Jenkins (ref » 78) found that subjects who give common word associations 
learn lists of high and medium built-in associations faster than subjects 
who given uncommon associations, but they learn lists with low built-in 
associations more slowly* 

Finally, we should note the recent work in testing on "moderator" 
variables (e.g., refs. h6^ 11?)* In this work a preliminary test is 
used to classify subjects into two or more groups. For each group the fur- 
ther tests to be given can be weighted in such a way so as to provide maxi- 
mum validity for that group* Frederlksen and Gilbert (ref. k6)^ for exaniple, 
first classified engineering students as being either high or low in interest 
in accounting. They found that a measure of interest in engineering could 
better predict grades for those who had a low interest in acccAmting than for 
those who had a high interest in accounting • Further work in this area 
might suggest to the programmer what test variables would be useful in fixed- 
treatment placement . 

Branching Programs 

Now we turn to the second major type of adaptive programming, the branch- 
ing program . In this type of program the material presented to the learner 
is always contingent upon his response to the previous training material. 
The reader will recall that in fixed -treatment placement the learner is as- 
signed just once to a fixed sequence of fraxoes on the basis of his respoiises 
on a pretest; in a branching program, the learner is periodically reassigned 
dxiring training on the basis of his responses during training. These reas- 
slgnments may occur as frequently as once per frame . 

In the branching program as it has been developed by Crowder, the burden 
of instruction is placed upon relatively lengthy expository material* This 
material is then followed by what is essentially a test item, to determine 
whether the learner has grasped the point of the expository material and can 
proceed, or has failed to grasp the point and must receive some remedial 
material. Following any necessary remedial material, the learner is returned 
to the missed item to attempt it again. 
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The l^ranching program is sometimes spoken of as involving a "two-way 
interaction between instructor and studentj" a "costanunication procesSj" or 
a "closed loop/' as contrasted with an "open loop" Skinner-type linear 
program. This language is used to stress the fact that in a branching pro- 
gram not only does the learner get feedback (knowledge of results) from the 
programnierj but the programm^x, also gets feedback from the learner. Perhaps 
this point is overen^jhasize^^ In both a branching prograip and a linear 
program the programmer gets feedback from the learner; in a Skinner-type 
linear program the progxammer uses the feedback during a tryout stage to 
minimize errors durin^trainingj while in a branching program the programmer 
uses the feedbac\^^^^ng training to determine what material the learner 
should be exposed to next. 

In testingj some work has been done on making the item given to the 
examinee next depend on how he responds to the preceding item^ Hutt (ref . 7^) 
studieu the effectiveness of a branching testing techniq^ue with the Stanford- 
Binet. For an "Adaptive" group of Ssj failure on a given item meant that a 
relatively easy item was given nextj so that failure on successive items was 
rare. Among poorly adjusted Ssj the' adaptive group achieved higher scores 
than a group administered the test luider standard conditions. 

Krathwohl and Huyser (ref. 89) and Bayroff j Thomasj and Anderson (ref. 6) 
have also developed branching tests. So far no generalizations that might 
be useful to the programmer have emerged from this work in adaptive testing. 

Row can the programmer determine whether a branching program pays off 7 
ftice againj the basic procedure is to compare the costs and benefits of 
using a branching program with^ the costs and benefits of using the best 
available alternate procedxirej which presumably will be a linear program. 

Silberman et al > (ref. 121) have suggested a refinement of this procedure 
in order to determine if any instructional effect which may be attributed to 
branching is due to the diagnostic-remedial effect of branching or merely 
to the extra material a group given a branching program gets. Branching had 
no diagnostic -remedial effect in the particular program with which they 
worked . 

In another study Coulson and Silberman (20) found no difference between 
bra^oching and nonbranching groups on a posttest dealing with the elementary 
psychology subject matter of the program. Just as we cannot conclude from 
Shay's work that fixed- trea^tment placement will never be usefulj we cannotj 
of coursej assume from this finding that branching will never be useful. 
Whether a branching program will be successful will obviously depend^ among 
other thingSj on the programmer's skill in drawing inferences from the 
learner's wrong responses. In a Crowder-type branching programj the pro- 
grammer attemopts to infer why a particular error was madej so that the under- 
lying misconception or faulty process can be cleared up. In this approach 
the programmer assujnes that the particiilar errors leetrners make convey some 
informationj that isj that different wrong responses reflect different proc- 
esses in the learner. Re further assumes that to some extent learners are 
similar in the misconceptions they initially hold or develop while going 
through the program^ and in the way in which their misconceptions manifest 
themselves in errors. 

Ar^ these assumptions justified? We t\irn to the testing literature 
for an answer. From his clinical experlencej Rapaport states that "...many 
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of the Intelligence test responses are highly conventionalized j and that a 
suhject knovs vho vac President of the Ifelted States hefore Roosevelt Dierely 
adds to his general score* But vhere the response deviates from the con- 
ventional, the deviation does not merely fall to add to his scorej It must 
also he considered as a characteristic vhlch may give us material tovard the 
xmderstandlng of the subject** (reft 109j p. ^O). 

Davis and Fifer (ref . 29) found that the particular vrong responses 
mde hy examinees did convey Infonnatloni vhen scoring wel^ts vere developed 
for the misleads in raultlple-cholce items, the predictive power of the test 
vas increased, ^e programmer, of course, is not intejrested in measuring the 
learner's apftitude from his errors but rather in inferring the processes 
vhlch lead to his errors. Can the information conveyed hy vrong responses he 
used for this purpose? 

Findley and Read (in reft 9) shoved that errors made in ansvering 
arithiuetic questions may he classified in a vay vhlch shows differences in 
mean total test score among the examinees making errors in different cate- 
gories. Apparently, then, the errors made hy different examinees on dif- 
ferent q^uestions can he categorized in a meaningful vay. From the nature 
of the categories (e.g., ** interchanging the unknown vith vhatever lies on 
the opposite side of the equation,** **an error resulting from the confusion 
of division vith suhtraction **) ve may assume that they reflect different 
processes in the examinees t 

Some other studies vhlch also deal vith arithmetic errors suggest that 
the inferences one can drav from individual errors may he rather limited. 
Grossnickle (refs. 62, 63) found that in general students vere not consistent 
in the types of errors they made in division. In one of his studies only 
four of the 21 types of errors made vere at all persistent. 

Brueckner and Elvell (r,jf . 12) studied errors made in the multipLlcation 
of fractions . They found that a student vho made an error on one example 
did not necessarily make errors on similar examples, and, if he did, the 
errors vere often not of the same type. *'The pupil should he given at least 
three or four opportunities to solve examples of one type since single errors 
may he largely chance or accidental" (p. 177)- 

In hoth the Grossnickle and the Brueckner and Elvell studies the experi- 
menters assumed that vhen a stxiderrt makes an error on one example and does 
not make the same type of error on other examples vKlch afford him the 
opportunity to do so, the error made vas a **chance** error. An alternate 
e5<5laiiation is that the e5<5erimenters vere not classifying errors into **types** 
in the most fruitful vay and that some other classification sch^oes vould 
shov that certain types of errors did occur consistently across samples of 
behavior. In any event, Grossnickle and Brueckner and Elvell presented their 
problems in formats vhlch may have emphasized rapid, relatively ** mechanical** 
vork. Ifeider such circimjstances vhat ve may call **chance** errors may result.* 
We should keep in mind, of course, that the **chance" error category is a 
residual category for left-over errors vhlch may in the future be better 
classified in yet-to-be-developed error classification schemes. 



This possibility vas pointed out by Lei^ton Price in a personal 
communication t ^ 
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Utie analysis of data in laboratory studies of human learning has 
also sometimes involved the classification of errors (e.g.^ 50). Recently 
Cook (ref. l8^ p. 2) has developed a generalised scheme for classifying 
errors which occur in paired-associate learning experiments. He makes a 
^aslc distinction between a legitimate response (a response which is 
any one of the... response terms in the experiment j whether this response 
has teen elicited tiy its proper stimulus term or ty some other" )^ an 
extraneous response (any response other than a legitlmte response )j and 
an omission (no response). He then further subdivides the legitimate and 
extraneous response classes. 

Ttie scheme may not tie as content-free as it appears. Cook assumes that 
the set of legitimate responses is part of a larger class of responses. He^ 
thereforej subdivides the extraneous responses into those which are and those 
which are not members of this larger class. For example^ if Bj Cj and 
E constitute the set of legitimate responseSj he would consider "Q" to come 
from the same larger class that the legitimate responses C> Dj and E 

come fromj and he would consider '*3*' not to come from that class. But one 
could conceive of the legitimate responses A, Bj C^ Dj and E as coming from 
the class of the first 13 letters of the alphabetj rather than of the entire 
alphabet. In that case "Q" would not come from the same class as the legiti^ 
mate responses come from. 

It seems that the experimenter must have some intuitive idea of the 
larger class to which the legitimate responses "belong** in order to use Cook's 
scheme* Presumably this intuitive idea would come from his knowledge of the 
mediational processes common to his learners. 

As it now £5tandSj Cook has found the scheme useful in reanalyzing the 
data of Kopstein and Roshal (ref. 87) and of others. The scheme could be use^ 
ful to auto-instructional programmers if remedial actions were specified for 
learners who make responses falling into the various categories > Up to noWj 
adaptive programming in the learning of paired-associates has been limited to 
the dropping of pairs from a list when the learner's responses indicated they 
were mastered (e.g.j ref. ll^f). 

In generalj if different wrong responses do convey information concerning 
the particular misconceptions held by the learnerj how can the programmer use 
this information to provide the learner with remedial material? 'The program^ 
mer's task is to determine what errors are commonly madej and what misconcep- 
tions or faulty thought processes they reflect. In some cases the programmer 
may be stifficientljr familiar with the subject matter and with the characteris- 
tics of the learners for him to know what errors are commonly made and what 
they reflect. In other cases he can determl;;e what errors are commonly made 
by pretesting his frames in constructed ^response format. He would then use 
the commonly given wrong responses as alternatives in the multiple-choice 
format of the program. Biis procedure is scMiietimes useful to the test con- 
structor (see Mkins and Toopsj ref. l) who selects as mult ijxLe -choice alter- 
natives those answers which not only are rather commonly given but also which 
tend to be given by examinees with lower total test scores. Research is needed 
to determine whether the mean aptitude or variability of aptitude of those 
learners choosing a particular multiple -choice alternative is related to the 
usefulness of providing remedial material for that alternative. 

^ Ttie programmer who has determined what the commonly given wrong responses 
are must then find out what they reflect. Buswell and John (ref. 1^) found 
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that they could get at the mental processes that children go through in 
doing arithnietic prohlems hy interviewing them and asking them to "think 
aloud." Pressey and Caniphell (ref . 107) report that interviews were useful 
in getting at the reasons for spelling errors in the capitalization of words . 
'Stie programmer may also find interviews useful in finding out what wrong 
responses reflect. 

llie prograjnmer who has identified cononon errors j and gone on to determine 
their source and to provide remedial materialj has done the necessary pre- 
liminary work in writing a branching program. He can now go on to the 
writingj trying outj and revising of Isranching material. 

tteasuring Achievement in a Branching Program 

*hie programmer might do well to make hranching in his program contingent 
upon the learner's responses to several fraineSj rather than to a single frame. 
*hi±s recommendation is in line with Brueckner and Elwell's previously cited 
findingj and also Gilbert's suggestions (ref . 56). A further rationale for 
this reCQmmendation is that in most cases it is inipossihle to adequately 
sample a domain with a single test item. "An item with a validity coefficient 
as high as 0.25 0*30 usually represents an outstt^ndingly valid item" 
(ref. 131j p. 2lf5). Finallyj Crowder (ref. 26) has recently pointed out that 
the relatively ine5q)ensive scrambled hook can he used even when the prograraner 
wants to make hranching contingent upon the learner's responses to several 
frames rather than to a single frame* 

In some existing hranching programs (e.g.j ref. 25)j the learner is 
given remedial information when he gives a wrong response to a frame and then 
he is returned to the frame he missed, llie frame is now called upon again 
to serve as a one-item achievement test, Where initially this frame may have 
heen inadequate for this purposej it is now even more ^inadequate for this 
purpose since the learner may he ahle to rememher and reject the previously 
made incorrect response, lliis would increase the likelihood that he will 
choose the correct answer merely hy guessing. 

In addition to xising responses to several frames as a basis for hranchingj 
the programmer may use "alternate forms" of the hranching frames to help 
resolve this difficulty. After he receives remedial informatioHj the learner 
should he presented with frames which cover the same concept as was covered 
hy the previous frames used for hranjjhingj but in which some specifics are 
altered* Again if the programmer uses several alternate form frames he would 
have a better basis for further branching. 

Consider this example: After presenting a definition of ''factorsj" 
Crowder (ref. 25j p. ih) asksj "Which of the sets of numbers below are the 
factors of the number 15? 



An alternate form criterion frame which reads as follows might bo given 
to tlie learner who has turned to page 19 or 31 and received remedial material: 



3 and h 
3 and 5 
2 and 13 



Page ig 
Page 25 
P&ge 31- 
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Which of ttie sets of nuiribers helow are the factors of the nuniber 21? 



3 and 7 
5 and 7 
7 and Ih 



Bage 6 
Page 9 
Page 10. 



The programmer could also use the alternate form fraine techniq^ue In 
frame revision. Ifj following remedial infonnationj the learners do not 
choose the correct answers on the alternate form frames substantially more 
often than chance would alloWj the programmer should check over his diagnosis 
of the original error and the remedial information he has provided^ since 
revision is indicated* 



It has prohahly occurred to the reader that whyi the learner in a 
hraaching program guessesj the potential advantages of a branching program 
over a linear program are reduced. ^Ihe recommendation to use responses to 
several frames as a hasis for branching will help minimize the effects of 
guessing. A second way the programmer might minimize these effects mi^t 
he to instruct the learners not to guess. Swineford and Miller (ref. ^28) 
studied the effects of such instructions in a vocabulary testing situation. 
They found that instructing the subjects not to guess reduced but did not 
completely eliminate guessing. They nt^asured guessing l)y the nimiber of times 
the subjects attempted to provide synonyms for nonsense words. 

Their technique of measuring guessing suggests a way to train learners 
not to guess: Learners could be given nonsense frames for which all alter** 
natives lead to negative knowledge of results ("wrong"). These frames would 
be interspersed with legitimate fraaiies. The nonreinforcement of guessing on 
the nonsense frames would be e5<5)ected to decrease the future occurrence of 
guessing;. Successfil guessing on legitimate frames would mean that guessing 
on frames**in-general would be intermittently reinforcedj but this intermittent 
reinforcement is always present. Hxe nonsense frames would ^erve to decrease 
ttie percentage of times that guessing is reinforced. 
. 

Hie reduction of guessing through instruction or through training can 
only be effectivej of coursej if the learner who realizes he has no idea of 
the correct answer to a frame is given the opportunity to say so. The pro** 
grammer can provide this opportunity by ixsing "I don*t know*' as a response 
alternative. A "don't know" response may also provide useful information 
during tryout and revision. If it tends to be chosen by learners of higher 
aptitude^ a subtle ambiguity may be present in the frames. 

Up to now we have confined ourselves to considering only the particular 
errors made by learners as a basis for branching. Now we will txirn to a 
theoretical formulation which will lead us to consider the use of response 
time as a basis for branching. 

Response Time and Branching 

Amsel (ref. h) has provided a theoretical framework for Identifying 
situations in which it may be quite desirable to have errors committed during 
training. When the learner starts out with a strong (superthreshoM) correct 
response tendency and a strong (superthreshold) incorrect response tendency^ 
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the programmer who merely elicits and rewards the correct response may not 
produce'a sufficiently great difference in strength Ijetween the correct and 
incorrect response tendencies* Amsel hypothesi'os that in these situations 
the progranmier must also elicit (l?ut not reward) the incorrect response in 
order to iDe certain that the incorrect response will not iDe given after 
training is coinplete* It would seem that Ainsel's analysis, if valid, would 
apply more generally to all situations in which there is initially a super- 
threshold incorrect response tendency, whether or not there is also a super- 
threshold correct response tendency. 

How could a programmer detect these situations? In some cases, a super- 
threshold incorrect response tendency may olDViously Ije present, e*g*, in the 
context of learning iDinary arithmetic, the tendency to respond "2" to the 

frame "One plus one eqiaals ^?** In less olDVious cases, the programmer 

might detect such tendencies by pretesting frames in a constructed response 
format on a population comparable to the learner population* Hiis procedure 
might be useful in detecting tendencies which were common to many of the 
learners* The programmer would then write into the program several frames' 
designed to elicit and not reward the commonly given incorrect response* 

The programmer could also deal with superthreshold incorrect response 
tendencies which are idiosyncratic, but this would, of course, require a 
branching program* It would also require a more flexible teaching device - 
Specifically, the teaching device would have to be sensitive to the '^response 
latency,*' that is, the time interval between the presentation of the frame 
and the learner*s response to it* Any response made with a latency shorter 
than a given value could be considered to indicate a superthreshold tendency. 
The teaching device might be set up to repeatedly expose the learner to any 
item on which this occurs but to only repeat once an item on which an in- 
correct response with ^ longer latency was made, or on which no response was 
made* 

There is some evidence which suggests that if the above procedure were 
used, then the critical response latency should be set at different points 
for different learners. Tate (ref* 130) gave arithmetic reasoning, nimiber 
series, sentence coinpletion, and spatial relations test items ^t each of 
three difficulty levels to a group of subjects. He found that each S had a 
characteristic speed of response \/hich was relatively independent of*"the 
subject matter of the item, the difficulty level of the item for the group, 
and whether or not he got the item right* Research is needed to explore the 
diagnostic value of response latency information in general and the useful- 
ness of adjusting the critical latency to the individual* 

Finally, in addition to branching on the basis of the particular error 
made and on the basis of time taken to make an error, there is the possibility 
of branching at the discretion of the learner* Silberman et al , (ref* 121) 
found that a branching program was not siiperior to a fixed (linear) program 
when the conditions for branching were prescribed by ^E; but a program in 
which S had the option of branching was superior to a fixed program* At the 
moment there appear to be no specific applications of testing considerations 
to this type of branching* 
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Section 5» Summary 



This report is concerned with the implications of testing for auto- 
instructional programming* 

In Section 1 we saw that tests were used for predicting behavior and 
programs were used for modifying behavior* We noted that in spite of this 
difference in purposes > the steps one goes through in developing a test and 
in developing a program were similar* We further noted two relationships 
between testing and programming: (l) Tests are used in the evaluation of 
programs; (2) Tests are used in adaptive programming in which different 
learners are assigned to different seq.uences of material^ to differentiate 
among the learners • 

In Section 2 we briefly went over the steps in the construction of a 
program* These steps are Specifying ObjectiveSj Determining the Resources 
Availablej Planning and Developing Frames, Pretesting and Revlsionj Evalu- 
ation> and Providing Information to Program Users* 

Section 3 was organised around a more extensive discussion of the 
steps in the construction of a test and some implications for programming 
that emerge at each of these steps* In both testing and programming the 
first istep is the specification of objectivesj and this ultimately involves 
a choice of operationally defined criteria* Various considerations that 
the test constructor takes into account in choosing criteria are relevance^ 
possible bias, and reliability* The importance of these considerations for 
the programmer was discussed* The point was made that the programmer was 
apt to look for internal criteria^ that is, measures during training of how 
well the learners were doing, and that such criteria may be rather poor* 
The necessity of combining criteria and the dollar criteria techni^jue for 
doing so were both discussed* 

The assessment of the available resources was then discussed, and we 
saw how the significance of this step differed for the test constructor and 
the programmer* Item writing suggestions were examined with the purpose of 
seeing how they mi^t apply to programming. We saw that specifying the 
terms in which a constructed response is to be given^ which may facilitate 
scoring in testing, may be more crucial in programming since it may a'^fect 
whether the learner gets knowledge of results* It was suggested that the 
task of classifying test items as to educational objective may be useful 
in the selection and training of programmers* Some research necessary 
before implementing this suggestion was pointed out. 

We went on to consider the differences in selecting test items and 
program frames and the greater importance in programming of further pre- 
testing* Under evaluationj we saw a basic similarity of approach: both 
tests and programs are compared as to costs and benefits with the best 
available alternate procedure* In order to carry out this comparisonj 
many diverse elements need to be combined, and again the dollar criterion 
is useful* Finally> both the potential test user and the potential program 
user need to know how well a test or program developed for use with one 
population of people will work with a different population • The possible 
use of a procedure analogous to test eq,uating> for estimating how useful 
programs will be for different populations^ was discussed, 
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Section h was concerned with two programming problems which do not 
usually occur in testingj but for which testing considerations are relevantt 

The first problem concerns the optimal ordering of the instructions^ 
material t It was pointed out that Guttman scaling may be useful if the 
programmer wants to start different learners at different points j in order 
to capitalize on initial differences in their capabilitieSj and at the same 
time keep the same seq^uence of instruction. Simplex theory may also prove 
useful in the ordering of instructional materials but at present it needs 
further ' development , 

The Second problem was that of adaptive programmingj the providing of 
different learners with different se<iuences of materialt One type of 
adaptive programming is what the test constructor would call fixed-treatment 
selection. Testing considerations maybe used in its evaluation. Possible 
variables which mi^t prove useful in assigning learners to different 
seq^uences of material in fixed -treatment placement were pointed out. 

Another type of adaptive programming is branchingj making the learner's 
seq^uence of material depend on his responses during training, Ttie tryout of 
frames in the completion format may provide the programmer with information 
as to what errors are commonly made; these can then be used as alternatives 
for the multiple choice format of the branching program. Interviews may be 
helpful in finding out ^Aiat misconceptions or faulty processes are reflected 
by these errors. Alternate form criterion items may be useful in testing 
the effectiveness of remedial sequences both during revision and during 
operational use of the branching program. Making branching contingent upon 
responses to several frames may increase its effectiveness. Response laten- 
cy may have diagnostic value in a branching program; some research possibil- 
ities based upon Amsel's theoretical views were pointed out. 
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